The Making of a Cooled CMOS Camera
Now this post will be for some serious stuff involving video compression. Early this year I decided to make a lossless compression IP core for my camera in case one day I make it for video. And because it’s for video, the compression has to be stream operable and real time. That means, you cannot save it to DDR ram and do random lookup during compression. JPEG needs to at least buffer 8 rows as it does compression on 8×8 blocks. Other complex algorithm such as H264 requires even larger transient memory for inter frame look up. Most of these lossy compression cores consume a lot of logic resource which my cheap Zynq 7010 doesn’t have, or not up to the performance when fitting into a small device. Also I would prefer lossless than lossy video stream.
There’s an algorithm every RAW image format uses but rarely implemented in common image format. NEF, CR2, DNG, you name it. It’s the Lossless JPEG defined in 1993. The process is very simple: use the neighboring pixels’ intensity to predict the current pixel you’d like to encode. In another word, let’s record the difference instead of the full intensity. It’s so simple yet powerful (up to 50% size reduction) because most of our images are continuous tone or lack high spatial frequency details. This method is called Differential pulse-code modulation (DPCM). A Huffman code is then attached in front to record the number of digits.
Sounds easy huh? But once I decided to get it parallel and high speed, the implementation will be very hard. All the later bits have to be shifted correctly for a contiguous bit stream. Timing is especially of concern when the number of potential bits gets large when data is running in high parallel. So I smash the process into 8 pipeline stages in locked steps. 6 pixels are processed simultaneously at each clock cycle. At 12 bit, the worst possible bit length will be 144. That is 12 for Huffman and 12 for differential code each. The result needs to go into a 64bit bus by concatenating with the leftover bits from the previous clock cycle. A double buffer is inserted between the concatenator and compressor. FIFOs are employed up and downstream of the compression core to relieve pressure on the AXI data bus.
By optimizing some control RTL, the core is happily confined at 200MHz now. Thus theoretically, it could easily process at a insane rate of 1.2GPixel/sec, although this sensor would not stably do it with four LVDS banks. When properly modified, it could suit other sensor which does block based parallel read out. For resource usage, a lot of the LUTs cannot be placed in the same slice as Flip-Flops. Splitting the bit shifter into multiple pipeline stages would definitely reduced the LUT usage and improve timing. But generally the FFs will shoots up to match the number of LUT thus the overall slice usage will probably be identical.
During the test, I used the Zynq core to setup the Huffman look up table. The tree can be modified between the frames so the optimal compression will be realized based on the temporal scene using a statistics block I built in.
Now I just verified the bit stream to decompress correctly using DNG/dcraw/libraw converter. The only addition is a file header and bunch of zeros following 0xFF in compliance with JPEG stream format.
Read more: The Making of a Cooled CMOS Camera
Leave a Comment
You must be logged in to post a comment.