Download (direct link):
The processor architecture consists of adders, multipliers, and shifters that are interconnected in a manner that would support the computational structure of the specific filter. Figure 5.14 describes the processor architectures for computation of the lifting steps. All the lifting steps for DWT and IDWT are essentially of the form yi = a(xi-i -I- Xi+\) -I- Xi, where a is a constant multiplication factor. For the (5, 3) filter, the multiplication factors in both the lifting stages are multiplies of 2 and hence it can be executed by simple shift operations. As a result, the processor for computation of (5, 3) filter consists of two adders and a shifter, whereas the processor for computation of (9, 7) filter consists of two adders and a multiplier.
Figure 5.15 describes part of the schedule for the (5, 3) wavelet filter to transform a row (or in one dimension). The schedules are generated by mapping the dependency graph onto the resource-constrained architecture. It is
VLSI ARCHITECTURES FOR LIFTING-BASED DWT
Xo Xi X2 x3 X4
Fig. 5.13 Processor assignment for the (5, 3) wavelet filter.
Fig. 5.14 Processor architecture for the (5, 3) and (9, 7) filters.
assumed that the delays of each adder, shifter, and the multiplier are 1, 1, and 4 time units respectively. For example, Adderl of Pi adds the elements (x0, X2) in the second cycle and stores the sum in register RA1. The shifter reads this sum in the next cycle (third cycle), carries out the required number of shifts (one right shift as a = -0.5) and stores the data in register RS. The second adder (Adder2) reads the value in RS and subtracts the element X\ to generate yi in the next cycle. To process N = 9 data, the PI processor takes four cycles. Adder 1 in P2 processor starts computation in the sixth cycle. The gaps in the schedules for PI and P2 are required to store the zeroth element of each row.
5.3.8 A Generalized Two-Dimensional Architecture
Generally, two-dimensional wavelet filters are separable functions. A straightforward approach for two-dimensional implementation is to first apply the one-dimensional DWT row-wise (to produce L and H subbands) and then column-wise to produce four subbands LL, LH, HL, and HH in each level of decomposition as shown in Figure 4.3(a) in Chapter 4. Obviously, the processor utilization is a concern in direct implementation of this approach because it requires all the rows in the image be processed before the column-wise pro-
VLSI ARCHITECTURES FOR DISCRETE WAVELET TRANSFORMS
Cycle Processor 1 Processor 2
Adder 1 Shifter Adder2 Adderl Shiftei Adder2
1 - - - - - -
2 x0 + x2 - - - - -
3 x2 + £4 RAl - - - -
4 £4 + Xe RAl RS-xi=yi - - -
5 x6 + xs RAl RS-x3=y3 - - -
6 RAl RS-x5=y5 У\,Уз - -
7 8 9 10 RS-x7=y7 Уз, У5 У5,У7 RAl RAl RAl Уо RS+X2 RS+:e4 RS+Хб
Fig. 5.15 Partial schedule for the (5, 3) filter implementation.
cessing can begin. As a result, it requires a size of memory buffer of the order of the image size and hence increase total computation delay. The alternative approach to reduce these inefficiencies is to begin the column-processing as soon as sufficient number of rows have been filtered. The column-wise processing is now performed on these available lines to produce wavelet coefficients row-wise. Similar approach can be adopted for implementation of two-dimensional lifting scheme as well.
The two-dimensional architecture proposed in  computes both the forward and inverse lifting-based DWT in the traditional row-column fashion. However, the scheduling of data is done in such a fashion that column-processing can start as soon as enough data is available after row-wise processing as explained earlier in order to minimize the computation delay. As shown in Figure 5.16, the architecture consists of a row module, a column module, and two memory modules (MEM1, MEM2). The row module consists of two processors RP1 and RP2 along with a register file REG1. The register file REG1 is used to store the intermediate data between two lifting steps computed by RP1 and RP2. Similarly, the column processor consists of two processors CPI and CP2 along with a register file REG2. The register files REG1 and REG2 were used in between the processors mainly to locally store the intermediate results from the lifting steps in order to avoid access of memory for these intermediate data to store and read again. The register file REG2 is used to store the intermediate data between two lifting steps
VLSI ARCHITECTURES FOR LIFTING-BASED DWT 129
Fig. 5.16 Block diagram of the two-dimensional architecture.
computed by CPI and CP2. Internal logic of all the four processors RP1, RP2, CPI, CP2 is the same as shown in Figure 5.14.
When the DWT requires two lifting steps (as in (5, 3) wavelet filters), processors RP1 and RP2 read the data from MEM1, perform the computation along the rows, and write the data into MEM2. We denote this mode of operation of the architecture as 2M architecture mode. Processor CPI reads the data from MEM2, performs the column-wise DWT along alternate rows, and writes the HH and LH subbands into MEM2 and an external memory (Ext.MEM). Processor CP2 reads the data from MEM2 and performs the column-wise DWT along the rows that CPI did not work on and writes LL subband to MEM1 and HL subband to Ext.MEM. The data flow is shown in Figure 5.17(a).