Development of a power efficient image coding algorithm based on integer wavelet transform

Development of a power efficient image coding algorithm based on integer wavelet transform
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  DEVELOPMENT OF A POWER EFFICIENT IMAGE CODING ALGORITHM BASED ON INTEGER WAVELET TRANSFORM K. Masselos, Y. Karayiannis, I. Andreopoulos, T. Stouraitis VLSI Design Laboratory Department of Electrical and Computer Engineering University of Patras, Rio 26500, Greece E-mail:   ABSTRACT A novel algorithm for very high compression of grayscale images presenting features that lead to power efficient implementations is proposed. A simple methodology based on a hierarchical three- stage exploration of the algorithmic design space has been adopted for the conception of the algorithm. The proposed algorithm is  based on an integer wavelet transform, which is much more efficient in terms of data storage and transfer compared to the widely used real wavelet transforms. For the coding of the coefficients of the wavelet transform fractal techniques using small size computationally generated codebooks are applied. The performance of the  proposed algorithm is comparable to or better than that of existing standard algorithms. It is estimated using state-of-the-art high-level power estimation techniques that the  proposed algorithm achieves lower power consumption by several times compared to existing standard algorithms. 1. INTRODUCTION Image and video coding rapidly became an integral part of information exchange. Portability, packaging and cooling issues as well as green issues made power consumption an important design consideration [1]. This is especially true for image and video processing systems because due to their high complexity their power budgets are high as well. Optimization for power can be achieved in all levels of the design hierarchy. However the most significant power savings can be achieved in the earlier stages of the design trajectory. Most of the existing approaches at this level are  based on specification refinement [1] i.e. they start from an initial high level description of a given algorithm that meets some application determined constraints and try to generate a specification with improved characteristics in relation to power consumption. The improved application description is then passed to more implementation specific levels of the design flow [3],[4]. In the higher algorithmic level it has been demonstrated that the selection of the suitable algorithm that meets certain application-determined requirements may lead to important power gains [5] even larger than those that can be achieved in all the rest levels of the design hierarchy. Specifically [5] deals with image compression and proposes the use of tree-search vector quantization instead of full-search vector quantization in order to reduce the computational complexity. An attempt to formulate the problem of algorithm selection with respect to power consumption is described in [6]. However this approach cannot be used for the development of new algorithms with low power characteristics. From the work described in [6] it becomes clear that the algorithm selection/design task has larger impact on power than specification refinement techniques. Design guidance at the algorithmic level based on several algorithmic  properties is offered in [9]. Design of algorithms for image and video coding with low power characteristics is described in [6]-[9]. The algorithms described in [6],[8] adopt a wavelet-based approach combined with pyramid vector transformation for image coding [8] and decoding [6]. The main idea is the use of the pyramid vector quantization for the encoding of the high frequency wavelet coefficients. This leads to reduced power consumption due to the low memory requirements of the pyramid vector quantization since memory related power consumption forms the dominant  part of the total power budget of image and video  processing systems [1] In all the above-described cases no systematic methodology is adopted and the new algorithms are developed on an ad-hoc basis. Wavelet based still image coding methods are frequently claimed to be superior to block DCT based ones such as the JPEG, today’s still image coding standard [12] and they will be included in the JPEG2000, the new standard for still image coding. In this paper a novel power efficient algorithm for image coding is proposed. The algorithm is based on an integer wavelet transform and on fractal techniques and leads to better performance and power consumption than existing state-of-the-art algorithms. The algorithm has  been developed using a simple methodology for the conception of power efficient algorithms. The rest of the paper is organized as follows: In section 2 the high-level power models used by the proposed approach are discussed. In section 3 the methodology adopted for the power conscious conception of the  proposed algorithm is described. The proposed algorithm for very high image compression is described in section 4. Experimental results are presented in section 5 and finally conclusions are offered in section 6. 2. POWER MODELS - COST FUNCTION The evaluation in terms of power of the target algorithms is based on behavioral or algorithmic descriptions. These descriptions do not imply specific implementation architecture and thus they do not allow for an estimation of the architecture-dependent switching effects. Only some high level architectural assumptions related to the storage of the algorithms’ data structures are made. These assumptions take into consideration possible area constraints for custom hardware implementations and restrictions in the memory and interconnect organization for the case of programmable predefined processors. Another assumption that is made is that different data structures are stored in different memory blocks and not in different spaces of the same memory block. Power efficient assignment of different data structures in different spaces of the same memory block can be achieved by system level synthesis techniques in later stages of the design flow. However, the behavioral descriptions allow for a detailed analysis of the algorithm-inherent activities [10], such as the switching that occurs in any design implementation, independent of the chosen implementation style. This estimation procedure can  provide a relative evaluation of various algorithms, but not an accurate estimation of their power consumption, which is impossible to have at this high level of abstraction.  Nevertheless, this evaluation is very important, since it can drive decisions that have to be made during the early stages of the design process and affect drastically the final quality of the design. It must be noted that because of the  limited accuracy of any estimation that is performed at the algorithmic level, decisions must be based only on significant differences in such estimates. The cost function driving the proposed methodology for power efficient conception of image and video processing algorithms is the sum of the power consumed in the main functional unit types of the final implementation: offchipmemoryk onchipmemory l k l lineoffchipbusm lineonchipbusn arithmeticoperationim n i Total Power P P  P P P  = ++ + + ∑ ∑∑ ∑ ∑   The estimation of the power consumed by arithmetic operations is based on PowerPlay [11]. For the estimation of the power consumed in memories and in interconnect  buses the models presented in [1] are used. 3. METHODOLOGY FOR POWER EFFICIENT ALGORITHM DEVELOPMENT - LOW POWER ISSUES The outline of methodology that has been adopted for the  power conscious conception of the proposed algorithm is shown in figure 1. In the first step existing algorithms satisfying the application determined algorithmic  performance requirements are evaluated in terms of power. The evaluation is based on profiling of the algorithms and uses the power models described in section 2. Some architectural related assumptions are made at this point mainly related to the storage of the data structures taking into consideration area or other architectural constraints. After analyzing the existing algorithms the major power  bottlenecks present in them and the related tasks are identified. More efficient solutions for these tasks must be developed without affecting the algorithmic performance. In the following steps the power-related bottlenecks are tackled. This is performed in three successive steps in order to reduce the complexity of the huge algorithmic design space as shown in figure 2. In the first step the storage-related overheads are handled. Each task causing such overheads is analyzed in a first step. Then a new task is designed starting from existing relevant algorithmic kernels (and possible combinations of them). The main  principles that can be followed in this case are: a) Direct reduction of the storage requirements both in word and bit level. This leads to power reduction directly due to the dependency of the memory power consumption to its size. Furthermore power is favored in a less direct way. The smaller size of a data structure may allow storage on-chip and closer to the processing units increasing locality. b) Replacement of data storage by less power hungry data  path operations. After developing a new task, simulation is required to confirm that the desired algorithmic  performance is achieved. Power is evaluated as well to evaluate the positive effect of the step. In the next step transfer related bottlenecks are solved. The main principles in this case are: a) Direct reduction of the data transfers since the power consumed in the interconnect buses depends on the number of data transfers through them. b) Focusing on the processing of a small part of data that can  be stored closer to the processing units. In this way data are transferred through less power costly buses. c) Reducing the bitwidth of the more frequently transferred data since the power consumed in the buses depends on their width. In the last step arithmetic computation related  power bottlenecks are tackled. The main principles in this case are the reduction of the number of executed arithmetic operations (directly related to power consumption [1]), the replacement of power costly operations by other less  power expensive and the reduction of the operands’  bitwidths (directly related to power consumption as well). The order in which the power related bottlenecks are tackled is justified as follows: Arithmetic computational related issues are left at the last step because they have the smaller impact on power. Interconnect issues are handled after the storage related issues because they are more dependent on them. The output of the proposed methodology is a new power efficient algorithm meeting the application determined algorithmic performance requirements. Some architectural constraints may be also  passed to the lower more implementation specific levels of the design flow as well. 4. THE PROPOSED ALGORITHM In the first step of the algorithm the image information is transferred to the wavelet domain by using a reversible integer 2-10 filter (the TT wavelet transform [14]). Let ( ) 0  x ,  ( ) 1  x ,... be the input signal, and let ( ) 0  s ,  ( ) 1  s ,… and ( ) 0 d  ,  ( ) 1 d  ,… be the smooth and the detail outputs of the wavelet transform, respectively. Then we have: ( ) ( ) ( )( ) ( ) ( ) ( ) 221,2212  x n x n s n d n x n x n p n  + + = = − + +      where ( )  p n  is defined by: ( ) ( ) ( ) ( ) 32221221323264  s n s n s n s n − − − + + − + +     The portion of the sub-tree that lies in the coarser decomposition layer (_(,)  LL x x y ) is called a range sub-tree root.  In order to have the domain sub-tree roots that are needed to perform the fractal coding in the wavelet domain, a separate one-level wavelet decomposition is  performed at the_(,)  LL x x y matrix and the domain sub-tree roots occur at the immediate finer layer named 1  _(,)  LL x x y . This way each range sub-tree has the same number of coefficients with a domain sub-tree. A simplified BTP-type coder [15] encodes the  _(,)  LL x x y set while the rest of the image information is represented as the collection of range sub-trees and it is coded in a manner similar to the one described in[16]. Having defined the range and domain sub-trees derivation, the fractal block coding method is performed. The shrinking operation consists of simple multiplications, while the offset factors in the produced bit-stream are not stored. Finally, the geometrical orientation is performed with 90 degrees rotations of the coefficients in each sub-tree and the switching of the LH and HL subbands. The application of the fractal coding method in the wavelet domain creates a coded bitstream resembling the one produced in the srcinal fractal method. The offset terms are coded differently through the compression of the low frequency wavelet coefficients. This is treated in combination with a classification scheme that treats the geometrical and the luminance transformation in a separable manner. Beginning from the geometrical transformation, the  _(,)  LL x x y  square matrix is partitioned into non-overlapping squares of size  size size  R R ×  named  j  R φ  , where { } 2,4,8  size  R  ∈ , ( ) ( ) { } 2 0,..,2 dl  size  j N R ∈ ⋅  and dl   is the number of the decomposition layers of the wavelet transform. The 1  _(,)  LL x x y  square matrix is partitioned into non-overlapping squares of size  size size  R R × named k   D φ   with k   depending on the lattice spacing (1 or higher) and  size  R . Each square (  j  R φ   or k   D φ  ) is divided into upper left, upper right, lower left, lower right quadrants numbered sequentially with index {1,2,3,4} i ∈ . A domain  block    k   D φ   is entered in the possible matching list of a  range block  j  R φ   only if there is a rotation that can match the orderings of the four quadrants in the two blocks. In the second step of the classification procedure the codebook size is adapted to an arbitrary number of vectors with a procedure that excludes the most unlikely matches. The block variance 2 σ    is adopted as an appropriate metric that is invariant to the luminance transform features. The general idea is to sort by value all variances of the domain  blocks k   D φ   in lists. Then, for each range block, a specific window is selected within the list of domains and includes only the domains in this window. A cascade connection of the geometrical classification with the luminance classification is performed because the  previously described method for the codebook restriction works invariantly to the rotations of the blocks. The result of this method is a codebook specifically constructed for each range that is used for the minimization of the RMS error in the wavelet domain. As already mentioned the proposed algorithm has been developed aiming at reducing the power consumption of wavelet-based image coders. The main power related  bottleneck of the wavelet-based coders are the large storage requirements leading to extensive off-chip storage and the large number of accesses to the image and coefficient data. This complexity increases as the number of the wavelet decomposition levels is increased. It must  be noted that in order to achieve high compression wavelet  based coding schemes may require several decomposition levels. For example the Efficient Pyramid Image Coder (EPIC - an experimental image compression system based on a biorthogonal critically-sampled dyadic wavelet decomposition and a combined run-length/Huffman entropy coder [13]) applies a five level wavelet decomposition in order to achieve the compression ratio desired by the targeted applications while the approach  proposed in [8] uses a three level wavelet decomposition. Another important disadvantage of existing wavelet coders is the fact that they use filters with real coefficients. This increases the arithmetic computational complexity but also the power required for the storage and transfer of the computed wavelet coefficients. The proposed algorithm developed according to the methodology described in section 32 removes the above described bottlenecks and achieves power consumption in a number of different ways described below: • Storage related power The proposed algorithm uses an integer wavelet transform that has no growth in the smooth output i.e. the  produced wavelet coefficients (in all levels) can be represented using 8 bits. This reduces the memory space required for the storage of the coefficients and thus the  power consumed per access to the corresponding storage elements. The wavelet coefficients can be stored in exactly the same memory locations as the image data overwriting coefficients of the previous levels or image data. This can  be achieved in existing wavelet based coders as well but then more bits are required to store the real wavelet coefficients (EPIC, approach [8]) leading to larger power  per access and worse memory space utilization. The encoding of the wavelet coefficients is based on a tree-based fractal scheme that generates different but small codebooks for each tree. All the codebooks can use the same memory space since they are used for the encoding of one tree only. In this way memory that would be used for the storage of the complete codebook as in traditional vector quantization is replaced by the arithmetic computation used for the generation of the codebook of each tree favoring power reduction. It must be noted that the memory space requirements of the proposed tree-based fractal approach are close to those of the Pyramid Vector Quantization scheme used in [8]. • Interconnect related power The proposed approach uses a wavelet transform that achieves energy compaction by applying a small number of wavelet decomposition levels. In this way the number of accesses to the large off-chip memories storing the wavelet coefficients is reduced leading to significant power consumption reduction. Beyond the wavelet decomposition the processing is mainly performed on the wavelet coefficients in the lower frequency band. The size of this data structure is small enough to allow on-chip storage implicitly creating an extra level of memory hierarchy and thus reducing the power expensive off-chip data transfers. The rest wavelet coefficients are only accesses during the tree-based fractal coding. The small size of the codebook used for the tree-based fractal coding leads to reduced number of data transfers from the corresponding memory. • Arithmetic computation related power The small number of wavelet decomposition levels reduces significantly the amount of arithmetic operations required (mainly by the filtering). The use of an integer wavelet transform reduces the operand bitwidths throughout the algorithm (even in the tree based fractal coding) in comparison to existing wavelet-based coders. The use of a small number of codewords for the coding of each tree reduces the amount of arithmetic computation further. 5. EXPERIMENTAL RESULTS Comparing the proposed algorithm with JPEG and EPIC, at the same PSNR an increased compression ratio was achieved without loss to the perceptual image quality. In addition, because of the nature of the wavelet transform, the blocking effect that appears in block coders is disappeared. Figure 3 shows the diagram of the compression ratio versus PSNR for test image “man” of size 512x512. Obviously, the proposed algorithm is superior to the other two and on can observe that as the compression ratio increases the PSNR degradation decreases in a slower manner. It must also be noted that as the picture size increases the superiority of the algorithm against the other two algorithms under comparison  becomes even stronger. In table 1 the comparison of the  proposed technique to the one presented in [8] is  performed for test image. The results are comparable. 5101520   25   30   2425262728293031Man 512x512 Proposed: + EPIC: o JPEG: x   PSNR (db)   Compression   Figure 3: Comparison of results for image man (512x512) Compression Ratio Proposed PSNR Namgoong PSNR 0,15 24,38 24,48 0,20 25,30 25,14 0,25 26,56 26,28 Table 1. Comparison of results between the proposed algorithm and Namgoong's approach The proposed algorithm is compared in terms of power with some typical image coding algorithms namely: a) The JPEG standard for image compression, b) the EPIC wavelet based coder and c) the low power algorithm  proposed in [8]. Power estimates through profiling of the  algorithms and by using the power models described in section 3. For the power estimation the supply voltage was always assumed to be 5 volts. It is assumed that one frame (image) is stored in a memory of the target system. Storage related assumptions are made based on the assumption of custom hardware realizations with the area restriction of 50 Kbytes on-chip. In this way it is decided whether the data structures present in the algorithms will be stored on-chip or off-chip. For the JPEG algorithm the power estimates have been obtained using the code obtained from [17]. For the estimation of the power consumption of EPIC its code that is available from [13] has been used. This code is optimized as well in terms of storage requirements and external busload mainly targeting implementations on  programmable processors. The version using the 15-tap filters has been used. A code for the approach described in [8], using the 7-tap filters, has been selected. It must be noted that in [8] no details are given for the storage organization of the image data and the wavelet coefficients. The code that has been generated for this approach has been highly optimized for power by aggressively applying techniques described in [1]. Finally it must be noted that no aggressive specification refinement optimizations have been applied to the  proposed algorithm. The power results are given in table 2. Algorithm EPIC JPEG Namgoong Prop. Power mW 15685 3797 1772 1322 Table 2: Comparison of the different algorithms in terms of  power for images 512 × 512 From the results of table 2 it becomes clear that the  proposed algorithm is much more power efficient than the EPIC wavelet coder by 12-14 times. Furthermore it is  better than the JPEG by approximately three times and even by the highly optimized version of the low power algorithm described in [8] by 40%. 6. CONCLUSIONS A methodology for the power conscious design of image and video processing algorithms has been presented. It is expected that the use of this methodology will significantly reduce the design time and will lead to better results especially in terms of power. The proposed methodology has been applied for the development of a high compression algorithm for image and intra-frame video coding. The algorithm has better performance and consumes far less power than existing state-of-the-art algorithms. 7. REFERENCES [1] J. M. Rabaey, M. Pedram, “Low Power Design Methodologies”, Kluwer Academic Publishers 1995. [2] F.Catthoor, S.Wuytack, E.De Greef, F.Balasa, L.Nachtergaele, A.Vandecappelle, "Custom Memory Management Methodology -- Exploration of Memory Organisation for Embedded Multimedia System Design'', ISBN 0-7923-8288-9, Kluwer Acad.\ Pub., Boston, 1998. [3] A. Kalavade, P. A. Subrahmanyam, "Hardware/Software Partitioning for Multifunction Systems", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 17, No. 9, September 1998, pp. 819-837. [4] D. Kirovski, M. Potkonjak, "System-Level Synthesis of  Low-Power Hard Real-Time Systems", Proc. of Design Automation Conference 1997 (DAC'97), Anaheim, Cal. [5] D. B. Lidsky, J. M. Rabaey, "Low-power design of  memory intensive functions", 1994 IEEE Symposium on Low Power Electronics, digest of technical papers, pp. 16-17. [6] M. Potkonjak, J. M. Rabaey, “Power Minimization in DSP Application Specific Systems Using Algorithm Selection”, proc. of the 1995 IEEE Intl. Conf. on Acoustics Speech and Signal Processing, pp. 2639-2642. [7] T. Meng, B. M. Gordon, E. Tsern, A. C. Hung, "Portable video-on-demand in wireless communication", Proceedings of  the IEEE, special issue on Low-power design, vol. 83, no. 4, April 1995, pp. 659-680. [8] W. Namgoong, T. H. Meng, "A low-power encoder for   pyramid vector quantization of subband coefficients", Journal of VLSI Signal Processing Systems for Video, Image and Signal processing, Kluwer Academic Publishers, vol.16, April 1997, pp.9-23. [9] J. Rabaey, L. Guerra, R. Mehra, "Design guidance in the  power dimension", proc. of the 1995 IEEE Intl. Conf. on Acoustics Speech and Signal Processing. [10] J. M. Rabaey, “Exploring the Power Dimension”, IEEE 1996 Custom Integrated Circuits Conf., pp. 215-220. [11] D. Lidsky, J. M. Rabaey, “Early Power Exploration: A World Wide Web Application”, Proceedings of 33rd Design Automation Conference, Las Vegas 1996. [12] V. Bhaskaran, K. Konstantinides, “Image and Video Compression Standards”, Kluwer Academic Publishers, 1994. [13] EPIC code [14] D. Le Gall and A. Tabatabai, “Sub-band coding of digital images using symmetric short kernel filters and arithmetic coding techniques,” IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY,  pp. 761–765, 1988. [15] J A Robinson, "Efficient general-purpose image compression with binary tree predictive coding," submitted December 1994 for publication in IEEE Transactions on Image Processing. [16] G. Davies, “A Wavelet-Based Analysis of Fractal Image Compression,” IEEE Trans. Image Proc. Vol.7 NO. 2  pp. 141-154 Feb 1998. [17] JPEG source code Performanceevaluation of existingalgorithms meetingthe performancerequirementsIdentify power  bottlenecks andrelated tasksHigh levelarchitecturalrelated decisions Remove power related bottlenecks  New power efficientalgorithmArchitectural restrictionsEvaluate power ProfilingHigh level power modelsArithmetic performance Area constraints or restrictions from thetarget architecture   Figure 1: Methodology for power conscious algorithm conception Tasks causing power relatedbottlenecksDesign of new morepower efficient taskNew more power efficient taskExisting algorithmic kernelsfor given taskSimulationDirectly reduce storagerequirements (both in wordand in bit level)Replace memory by lessexpensive arithmeticcompuation 1. Removing storage related bottlenecks Directly reduce number of transfersIncrease locality by focusingon a small part of dataReduce bitwidth of frequentlytransferred data 2. Removing transfer related bottlenecks Reduce the number of arithmetic operationsReduce the strength of complex arithmetic operationsReduce the operands'bitwidths 3. Removing computational related bottlenecks   Figure 2: Approach for removing power related bottlenecks.
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks