How to Tackle Real-Time Video Processing
投稿人:DigiKey 北美编辑
2016-08-31
The human brain is the most advanced compact processing unit we know about; but improvements in image processing may see machines soon surpass us, thanks to new processors, tools, architectures and software. The new technologies and the their rapid rate of adoption holds enormous potential for industrial manufacturing and inspection, as well as medicine, consumer electronics/gaming, and of course, robotics.
Right now, we do advanced functions like proportional, integral, and differential (PID) functions when performing simple tasks like filling a pitcher of water. In fact, we do such complex motion control and balancing that robots may soon be envying us. However, one of our outstanding abilities has been our ability to pick out patterns in our visual fields, perform object recognition, depth perception, track motion and gauge relative speed and even acceleration, setting us apart from most machines.
Early image processing was focused on clarifying still images and many of the algorithms for edge enhancement and bringing out subtle detail were not performed on fast frame rate and highly resolute images. Processors were simpler, slower, and less able to perform DSP functions so results were sketchy, slow to obtain, and nowhere near as reliable as the discerning human eye and brain.
However, as we evolove our machines, we are starting to give them superior capabilities that we can’t match, and image and real-time video processing is one of these capabilities our future machines (overlords?) will have.It is thanks to the processors, tools, algorithms, and heuristics that we are formulating now, that we are able to give machines these capabilities that may soon be far in advance of our own.
Memory and architecture matters
Video information is processed in the digital domain, so it doesn’t matter whether the image comes from NTSC, PAL, RGB, YUB, component, or HD sources that are interlaced or non interlaced. Front-end sync and decoder chips and hardware stages do a fine job of capturing the image pixels and stuffing them into an array of memory, typically as rasterized scan lines.
The memory architecture and topology do matter, however, as they affect how a processor accesses and manipulates the data. For example, you can squeeze image data into memory in a linear fashion, but this may mean that adjacent frames will not line up at bit-addressable memory boundaries. It is faster and easier, for example, to have the first line of a frame’s scan line align at a binary counter’s zero point. Comparing successive frame data becomes much easier to access and obtain if complex addressing schemes are not needed to index into and look at the same part of a visual field from frame to frame.
Another memory-related factor that can simplify video processing and information extraction has to do with bit-plane separation, especially as pixel resolutions and palette depths increase. When this happens, the ability to look at monochrome images in a most significant bit (MSB) to least significant bit (LSB) fashion allows boundaries and edges to stand out (Figure 1). The largest contrasts occur when looking at only the MSB-rendered images. Some depth and gradation information can be extracted looking at successively less-significant bit-rendered images.
Figure 1: Bit-plane separation allows edges to stand out more clearly as the most significant bits (MSBs) show more contrast than the least significant bits (LSBs). This means that parallel video data needs to be decimated and bit magnitudes separated into individual monochrome memory planes. (Source: DigiKey)
This means that image data must be separated in real-time into bit-oriented efficient blocks of memory that a processor can index into and examine quickly. Here is where hardware architecture comes into play.
Parallel versus pipelined
Two major architectural approaches can be used to accomplish real-time video processing tasks. The first is to use multiple processors and assign each one a part of the problem, either in parallel, in series, or both. For example, an array of 24 processors that each examine a bit-plane separated memory block can discern edges and boundaries 24 times faster than a single processor that has to examine each block one at a time. Task-executive processors delegate tasks to each individual processor and examine condensed results in partially “digested” form. This allows next-stage decision-making processors to act on the data in a much more responsive way.
This is especially important when it comes to stereoscopic vision processing, where depth perception is more realizable. Bit-plane separated images may show edges, but, the location of the same edges in different parts of each camera’s memory blocks will be what is used to determine depth from a digitally triangulated algorithm. As these edges move from frame to frame between the two camera’s memory blocks, the processor can extract distance, speed, and direction data more quickly than if it were trying to do that using parallel color pixels frame by frame (Figure 2).
Individual processors monitoring the individual bit planes can accommodate shading, light sources, and even discern partially obscured edges of a tracked object as it moves in front of and behind other moving or stationary objects in the field of view.
Figure 2: Stereoscopic imaging can leverage the contrast brought out by bit-plane separation by examining the location in the memory array of the detected, as well as the distortion of each relative image. These can be used to ascertain depth as well as relative motion as an object moves in successive frames. (Source: DigiKey)
It is also important to note that decimated pixel and palette resolutions can be used. Not every bit of a 24-or 36-bit A/D-converted colored pixel is needed to extract the needed information. Ambient light levels and coloring can be used as part of the algorithm to determine the optimal decimation of resolutions.
Simple vs. complex architectures
The term processor can vary here. It can be a rather high-horsepower, fast-clocking CISC DSP type of functionality, or, a hardware-based functional RISC block, or anywhere in between. Here is where another architectural choice can be made by design engineers.
Do we use multiprocessors or multi-core processors that have complex architectures and signal-processing features, or do we use dedicated logic blocks implemented in discrete hardware like FPGA structures? Both are valid approaches, and both carry with them features and benefits. Also, they are not mutually exclusive: they can be used together to create an optimal solution for the tasks at hand.
For example, vision systems used for applications like quality-control inspection applications can use snapshot-like inspection algorithms that look at specific things like weld continuity, solder-pad integrity, circuit-board trace consistency and structural defects. This can occur when boards are moving past a stationary field of view, or as cameras are moving around with a robotic mechanism.
A multi-core processor family like the NXP i.MX 6DualPlus and i.MX 6QuadPlus applications processors is designed for graphic intensive applications. The family features 1.2-GHz ARM Cortex-A9 processers with advanced memory management and interfaces for 64-bit DDR3 or dual-channel, 32-bit LPDDR2 memory integrated with internal L2 cache RAM.
A key feature in this context is 3D graphics acceleration with quad 720-MHz shaders that can deliver high-performance graphics using dedicated hardware. Specific parts like the quad-core 32-bit MCIMX6QP5EYM1AA support – in hardware – dual displays of up to 1920 x 1200 (HD) resolution, meaning that frame memory and refresh are intrinsically supported for image acquisition and display (very useful for debugging and testing out algorithms), as well as image processing.
Hardware resizing and de-interlacing simplify video data formatting and indexing to permit more coding resources for image data processing and overlaying onto display data. Examples of this overlaying include resized images that can look at specific parts of a data frame to decrease processing time. Another example is the use of shaders that not only can make an image stand out more to the human eye, but can also bring out detail in an image to be processed.
Example code and hardware designs are also a plus using the NXP MCIMX6QP-SDB Development platform which includes complete hardware design files and direct support for Android, Linux, and FreeRTOS operating environments. To simplify interfacing, HDMI connections as well as LVDS connections and support for two 1 GHz MIPI DSI data lanes are on board, as are camera MIPI CSI ports (Figure 3).
Figure 3. The multi-core processors of the NXP MCIMX6QP-SDB development platform can render and process image data in real time as demonstrated by the development environment tool kits and demonstration software. (Source: NXP)
By themselves, these processors are capable of being coded to look at the conical shape of many solder joints in a single high-res stereoscopic snapshot frame, for example, as needed for applications like pc-board quality inspection video processing.
In addition, multiple processors can be cascaded to expand the capabilities. However, when software is still too slow, dedicated hardware by itself, or in combination with these high end types of processors can enhance performance to the highest levels.
Hardware-based algorithms
Pipelined hardware logic sequencers will always perform faster than software-based processors. It may take a few cycles for the pipeline to fill, but, once filled, every clock cycle is delivering processed and digested information rather than the multiple clock cycles a coded approach may need for the same level of delivery.
While inspection machines may not need the blazing speed of hardware-based acceleration, many applications will. A field artillery robot fighting an enemy field artillery robot will only win if its target acquisition, motion control, and accuracy are superior. The fastest and most streamlined hardware will survive.
Modern day logic arrays have the density, speed, and depth, as well as memory and peripheral interfaces to accommodate very complex tasks, including real-time video. The multi-gate level architectures of massive logic arrays have for the most part been replaced by look-up-table (LUT) logic elements that are able to output each logic block’s results in a single synchronous clock cycle. Keep in mind that modern logic arrays can clock in the GHz range.
To develop your video-processing algorithms with FPGAs, you will want to use a platform that implements the target parts, handles uploading of logic code patterns, provides access to I/Os, has clean clock sources, and includes a good mix of on-chip hard macros for specific needs such as phase lock loops, (PLLs) and high-speed memory interfaces. Once you have identified a manufacturer and part family, the development platforms will fall into place.
As an example, take National Instruments’ Digilent Nexys Video FPGA Board, featuring the Xilinx high-density, high-performance Artix family of FPGAs (Figure 4). These range from 1300 to 16,825 logic cells and from 45 to 740 DSP sliced function blocks. A key feature is also the dual-port RAM speeds of up to 509 MHz, allowing concurrent access and modification by independent stages of logic.
Figure 4. Developing pipelined video algorithms using FPGAs is greatly simplified with a development platform such as the Digilent Nexys Video FPGA Board from National Instruments that includes video processing inputs and outputs feeding a top-performing highly-dense FPGA (in this case from Xilinx.) (Source: Nexys, a National Instruments company)
The National Instruments board uses a low-power 1 V core and features quad-SPI Flash as well as DDR3 memory interfaces for direct access to external pools of non volatile and volatile memory, repectively. Both HDMI and DVI interfaces are also included using transition-minimized differential signaling (TMDS) standards.
While development platforms are a good way to evaluate and design in a relatively rapid timeframe with relatively low risk, eventually, you will want to spin your own boards and use your own test-and-development platform. Fortunately, the Xilinx Vivado tools suite is supported under the free Webpack license which includes the ability to create Microblaze soft core processors that can be used inside the FPGA as process engines to direct and manage the programmability of your logic-based pipelined architecture. Design resources, example projects, data sheets, and tutorials can be downloaded at the Nexys site.
Conclusion
With low-power, high-performance semiconductors and the right tools, architecture and algorithms, machines will quickly surpass humans in areas that we once thought we were far superior, such as pattern recognition, object recognition, depth perception, motion tracking and the ability to gauge relative speed and even acceleration.
Not only will this enable new applications in industrial control and medicine, but also consumer electronics, gaming, and of course, robotics. In the case of the latter, we may soon have to re-evaluate our own performance, relative to machines. Interesting days ahead!
免责声明:各个作者和/或论坛参与者在本网站发表的观点、看法和意见不代表 DigiKey 的观点、看法和意见,也不代表 DigiKey 官方政策。