Download (direct link):
The convolution-based implementation of discrete wavelet transform has been studied extensively in the literature. There are many hardware architectures for VLSI implementation of the convolution-based DWT reported in the literature in last two decades [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. Development of VLSI architectures for lifting-based DWT is relatively a new area of study in both academia and industry. We describe some of the recent such architectures reported in the literature in the following sections.
5.3 VLSI ARCHITECTURES FOR LIFTING-BASED DWT
We have presented the underlying principles and formulation of the lifting-based discrete wavelet transform in Chapter 4. In this section, we present how the lifting operations can be mapped into hardware architectures and their possible VLSI implementations.
Input Xq Xi x2 X3 x$ Xq Х-j x$
Fig. 5.6 Data dependency diagram of lifting-based DWT with four lifting factors.
The data dependency of the lifting scheme can be explained via a dataflow graph as shown by a block diagram in Figure 5.6. For the DWT filters which can be decomposed into four lifting factors (i.e., four lifting steps), the computation is done in four stages. These four stages are depicted in Figure 5.6. The values of a, b, c, d, and K depend on the selection of the DWT filters. Once the DWT filters are chosen, however, they are constant throughout the processing. The intermediate results generated in the first two stages for the first two lifting steps are stored temporarily and these
VLSI ARCHITECTURES FOR LIFTING-BASED DWT
intermediate results are subsequently processed to produce the high-pass (HP) outputs in the third stage followed by the low-pass (LP) outputs in the final stage. The popular (9, 7) filter is an example of a DWT filter that requires four lifting steps with a = -1.586134342, b = -0.05298011854, c = 0.8829110762, d = —0.4435068522, and K = 1.149604398. For the DWT filters requiring only two factors such as the (5, 3) filter proposed in JPEG2000 standard, the intermediate two stages can be simply bypassed.
Several architectures have been proposed in the literature for implementation of the lifting steps in VLSI. These architectures range from highly parallel architectures to programmable DSP-based architectures to folded architectures. We present systematic derivations of some of these architectures in the following sections.
5.3.1 Mapping the Data Dependency Diagram in Pipeline Architectures
A direct mapping of the data-flow diagram of the lifting steps into a pipelined architecture initially was proposed by Liu, Shiau, and Jou . Block diagram of this pipeline architecture is shown in Figure 5.7. Several variations and enhancements of this architecture were proposed later for improved performance and better hardware efficiency.
Fig. 5.7 Data dependency diagram mapped into a hardware architecture
The architecture shown in Figure 5.7 is designed with 8 adders (A1-A8), 4 multipliers (M1-M4), 6 delay elements (D), and 8 pipeline registers (R). There are two input lines in the architecture, one with all the even samples (x2i) and the other with all the odd samples (x2i+\)• There are four pipeline stages in the architecture. In the first pipeline stage, the output of adder A1 is x2i + x2i-2 and the output of A2 is a(x2i + x2i_2) + x2j_i, which represents the first intermediate results in the data dependency diagram for lifting as shown in Figure 5.6. In a similar fashion, the outputs of A4 in the second pipeline stage represent the second intermediate results. Continuing
VLSI ARCHITECTURES FOR DISCRETE WAVELET TRANSFORMS
in this fashion, A6 in the third pipeline stage produces the high-pass output samples, whereas A8 produces the low-pass output samples. The cost of hardware of this architecture is very high because of requiring 4 multipliers and 8 adders. Moreover, for the lifting schemes requiring only two lifting steps, such as the (5, 3) filters, the last two pipeline stages need to be bypassed, causing hardware utilization to be only 50% or less. Also for a single read port memory, the odd and even samples are read serially in alternate clock cycles and buffered. This slows down the overall pipelined architecture by 50% as well.
A similar pipeline architecture for VLSI implementation of (9, 7) wavelet filters was proposed by Jou, Shiau, and Liu in  based on a pipeline scheduling technique adopted by converting the behavioral description of the lifting into the corresponding data flow graph (DFG). The resulting datapath is similar to Figure 5.7.
To minimize the hardware inefficiencies described in the above pipeline architecture, several architectural enhancements and design methodologies have been proposed in the literature. We describe some of them in the following sections.
5.3.2 Enhanced Pipeline Architecture by Folding
Hardware utilization of the pipeline architecture described in Figure 5.7 is 50% or less for the wavelet filters with only two lifting factors (e.g., (5, 3) filter). The utilization of the pipeline architecture can be further improved by carefully folding the last two pipeline stages into the first two pipeline stages, as shown in Figure 5.8.