# A Novel Low Complexity Energy Efficient DCT Architectures

**Neha Bhargava<sup>1</sup>**, **Shweta Agrawal<sup>2</sup>** <sup>1</sup>Research scholar,<sup>2</sup> Assistant Professor, Dept of Electronics and Comm., SRCEM Banmore, Morena, India.

Abstract— DCT is the prime component in image compression unit which is employed in most portable devices exhibiting multimedia applications. Discrete Cosine Transform (DCT) is the most compute intensive unit in the image compression standards. The performance of the DCT significantly affects the performance of the application. This paper presents a novel energy efficient DCT architecture by exploiting several non-significant driven complexity reduction techniques. The proposed DCT architecture provides energy scalability by exploiting boundary error resiliency. This architecture provides significant improvement in the energy efficiency without degradation in image quality. The proposed and existing designs are implemented and simulated with benchmark input to compute the efficacy of one over the other existing architectures. Simulation results show that proposed DCT architecture requires 10.3% reduced area and 52.1% reduced energy over the existing DCT architecture.

*Keywords*— Image Processing, Approximate Architectures, Energy Efficiency, Integrated Circuits, VLSI.

## I. INTRODUCTION

With the explosive grown in the multimedia applications on the portable devices, energy efficient image/video compression techniques are becoming more important. Significant research to achieve image compression standard that provides higher compression with small loss of quality in the reconstructed image/video. The large energy consumption in these portable devices increases the failure probability which may result in device failure. For example, large data of image/video is stored in compressed form and requires image/video compression where discrete cosine transform (DCT) is prime computing block within image compression. The high energy consumption of the DCT limits the performance of the image compression. The energy efficient DCT is the prerequisite for the portable devices due to limited battery size.

In the last few decade, the approach of energy-efficient design via scaling device dimension has now stopped due to increased process and other variation that degrades the performance of these devices significantly. Hence scaling the device dimension fails to achieve energy efficient design. Therefore, in order to design power and performance efficient VLSI architectures, an architectural or algorithmic level approach is required that can exploit the property of the application. This will result into an optimum architecture and provides improved quality. The rigorous device dimension scaling result in the dimension in deep sub nanometer regime where the process variation is becoming much severe. The designs without considering process variations will fail to provide desired output. Further, addition circuit to mitigate the effect of process variation is very costly in terms of power, area and delay such that gain due scaling are less than overhead. Therefore, other design methodology is required to develop designs for the modern gadgets. There are several applications where the approximate results are acceptable such as image/video processing. The relaxation in accuracy can be used to improve the designs parameters.

Significant research efforts have been devoted to achieve efficient DCT architecture. The techniques such as partial computations, computation sharing, adaptive bit-width architecture, voltage over scaling (VOS) etc. are some of them. An area efficient DCT architecture based on computation sharing multiplication (CSM) is proposed that reduces implementation complexity. Similarly, another area and power efficient DCT architecture is reported in which bit-width of the high frequency computations are reduced as these computations contribute small to the overall quality. Therefore, this architecture provides good design metrics with small quality degradation. The energy efficient DCT architecture can be obtained by exhibiting VOS on the non-significant coefficients which may cause timing error but will provide error of small magnitude. Application of the VOS on the significant coefficient for further power reduction may cause large degradation in quality, if avoided. A new architecture that reduces the timing error, error which may amplify at the later stage and error of high amplitude is presented in []. Further, the modern VLSI designs are suffered by the process and other variations which severely degrades the performance of the DCT. A process variation aware DCT architecture [] is proposed where design for the significant computations paths are implemented with small delay over the designs for the non-significant computation paths. Therefore, under process variations, only non-significant coefficients are affected and introduces error of small amount to achieve image of acceptable quality.

The rest of the paper is organized as follows. Section II provides an extensive literature review on the different DCT architectures. The proposed energy efficient DCT architecture is detailed in Section III. The simulation environment and results analysis of the proposed DCT over existing are discussed in Section IV. Finally, conclusion is given in Section V.

#### **II. LITERATURE REVIEW ON DCT**

This section presents review on different techniques to achieve energy efficient DCT architecture.

#### 2.1 DCT Principle and Its Property

The DCT is the mathematical operation like Fourier transform and converts signal in spatial to frequency domain [1]. It is the Fourier transform where only cosine terms are considered while sine terms are omitted. The mathematical expression from the 2D DCT is given below.

$$Y((k_1, (k_2)))$$

$$=\frac{2}{n}c(k_1)c(k_2)\sum_{n_1=1}^{N-1}\sum_{n_2=1}^{N-1}X_{n_1,n_2}\cos\frac{(2n_1+1)k_1\pi}{2N}\cos\frac{(2n_2+1)k_2\pi}{2N}$$

where k1, k2 = 0, 1, ..., N-1, c (0) = 1/p2; and c(n) = 1 for n = 0 This can be represented in a matrix form.

Fig. 1 shows architecture of 2D-DCT consisting of two 1D-DCT. From the figure it can be seen that, to obtain 2D-DCT coefficients, the transposed output from first 1D-DCT is given as input to next 1D-DCT. Therefore, only single 1D-DCT architecture has to be designed which can then be reused to compute 2D-DCT. Moreover, this approach of computing 2D-DCT requires memory to achieve transpose operation. Most of the existing DCT architecture employ this topology to achieve 2D-DCT. Thus, the most research efforts try to reduce the complexity of single 1D-DCT.



The higher energy compaction allows the value in the spatial domain can be represented by few coefficients in the DCT domain. If the four coefficients of DCT and the FFT are truncated and the inverse operation is done to recover the original signal, the recovered signal from the DCT is much similar to the original over to the recovered signal from the FFT. Therefore, DCT exhibits much higher energy compaction property over the FFT and therefore used in image compression standards.

## 2.2 CSM Architecture

The CSM architecture provides good quality-energy tradeoff by sharing computation multiplier in the DCT. In order to implement the hardware for the DCT architecture distributed arithmetic based computations are performed. The author modified the basic matrix of DCT such that, some constant multiplications become common between different expression and can therefore be shared. The approach can be effectively used in the other applications employing constant multiplications. For example, DCT coefficients a, b, c, d, e, f, and g with their accurate value and their proposed modified values are presented. The 8-bit multiplication is divided into 4-bit each, and common multiplication then shared. Moreover, to reduce the number

of different multiplier required author approximated the original multiplier such that very small error provides significant reduction in the implementation complexity.

# 2.3 Bit-Truncation Technique

Park et al. [8] presented a low complexity reconfigurable DCT design that reduces power consumption. The complexity of the DCT is reduced by reducing the bit-width of the coefficient multipliers. For example, few least significant bits of the multiplier of least significant coefficient is truncated. This truncation of the LSB reduces implementation complexity with the significant improvement in the power and performance without much degradation in the output quality. Further the author presented the reconfigurable architecture such that bit-width can be varied at run-time to achieve reconfigurable DCT architecture such that quality-energy tradeoff can be achieved.

| Original DCT bases                                    | Trade off Level 1                                    | Trade off Level 2                                    |
|-------------------------------------------------------|------------------------------------------------------|------------------------------------------------------|
| $ \begin{array}{cccccccccccccccccccccccccccccccccccc$ | $\begin{array}{cccccccccccccccccccccccccccccccccccc$ | $\begin{array}{cccccccccccccccccccccccccccccccccccc$ |
| # of non zero digits = 20                             | # of non zero digits = 16                            | # of non-zero digits = 13                            |
| (a)                                                   | (b)                                                  | (c)                                                  |
|                                                       |                                                      |                                                      |
| Trade off Level3                                      | Trade off Level 4                                    | Trade off Level 5                                    |
| Trade off Level 3 $a = 0 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0$ | Trade off Level 4<br>a = 0 1 0 0 0 0 0 0 0           | Trade off Level 5<br>a = 0 1 0 0 0 0 0 0 0           |
|                                                       |                                                      |                                                      |
| a=0 1 0 0 0 0 0 0                                     | a=0 1 0 0 0 0 0 0                                    | a=0 1 0 0 0 0 0 0                                    |
| $\begin{array}{cccccccccccccccccccccccccccccccccccc$  | $\begin{array}{cccccccccccccccccccccccccccccccccccc$ | $\begin{array}{cccccccccccccccccccccccccccccccccccc$ |

#### Fig. 2: Constant multiplier approximation

From the Fig. 2, it can be observed that author tries to reduce non-zero bits in the constant multipliers. The reduction in non-zero term reduces the complexity. The number of non-zero terms in original multipliers were 20 and reduced to 6 in level 5 as shown in figure. With each new trade-off level, the complexity reduces and power performance increases at the cost of small quality degradation.

## 2.4 Process Aware DCT Architecture

Banerjee et al. [10] presented an architecture of process variation tolerant DCT that reduce the power consumption by applying aggressive voltage scaling. The architecture is designed to exploit the property of the DCT which says that not all the computations in DCT exhibits equal contribution to the quality. Therefore, architecture with less contribution to the quality (non-significant) are designed with longer path over the significant computing path which are designed with smaller delay. In 8-point DCT the initial coefficients are significant over the higher coefficients. Therefore, the higher frequency coefficients are designed slower.



Fig. 3: Input image matrix and DCT matrices after 1<sup>st</sup> and 2<sup>nd</sup> transform.

From the Fig. 3 it can be seen that author compute first five coefficients faster over the last three non-significant coefficients. After doing 2D-DCT transformation, the resulting DCT matrix is exhibits 25 coefficients which are significant and computed faster over the rest which are non-significant. In the case of process and other variation only non-significant coefficients will show timing error which the significant coefficients computed without any timing error. Thus the design provides process variation tolerance as these variations produces very small error whereas in conventional designs significant coefficients are also shows large error and degrades the quality severely.

## 2.5 Timing Error Acceptance DCT (TEA-DCT)

An approach that early detects the timing error in the architectures due to voltage over scaling and corresponding offending technique is presented by Ku et al. [10]. The author implemented this technique in the DCT and showed a low energy inverse DCT architecture with controlled timing error. The paper presented a strategy to control the large timing error, the timing error which may get amplified at the later stage. This control logic exploits the knowledge of statistics of the operands and by rearranging the operations such that error may cancel out or diminishes but not increases.

The dynamic reordering of the reduces the possibility of timing error [11]. The author showed that number of error are reduced if number of addition of positive number with negative number is reduced. This can be achieved by first adding all positive number with negative numbers and large add positive sum with negative sum. Let us consider, four numbers (-1, 1, -1, 1) have to be added and each number is represented by 16-bit. The addition of (-1, 1) plus (-1, 1) will exhibits two large carry propagation over the addition of (-1, -1) plus (1, 1) which exhibits only single large carry propagation. Moreover, the carry propagation will also be of small distance over the first

approach. In this way author tried to reduce the timing error.



Fig. 4: Partitioning of the input matrix.

Proposed approach also partitions input matrix with different size of operator based on the significance of the coefficient. Moreover, to achieve 2D-DCT, two 1D-DCT operations are done. If error occurred in the first stage, it will get amplified in the second stage. Therefore, author showed no error on the first stage. The author achieves energy/ power efficiency by supply voltage scaling which introduces timing error. The author implemented an IDCT architecture and showed the efficacy of the proposed technique over the existing architectures.

## 2.6 Dynamically Reconfigurable DCT Architecture

An architecture of dynamically reconfigurable DCT which can be reconfigured to provide trade-off between quality (bit-rate) and dynamic power consumption is presented by Jiang et al. [12]. In this paper author realized an optimal architecture where its performance is measured in terms of output image quality, bit-rate and dynamic power consumption. In order to design an optimal architecture, author varied number of non-zero DCT coefficients and quality factor for quantization table. A dynamic partial reconfiguration controller is designed to achieve bitrate and dynamic power constraints. The resulting architecture as shown in Fig. 5 represent that DR controller is the key component of the reconfigurable DCT architecture.



Fig. 5: Dynamic reconfigurable DCT architecture

In this DCT architecture, a constraint such as bitrate is the input used to control the complexity of the DCT. Based on the constraints, number of significant coefficients are computed. The 8x8 image sub-block is used to process and DCT is evaluated which is then passed to the encoder. The encoder encodes the DCT coefficients using run-length coding scheme. The design is implemented simulated with benchmark image and showed the significant improvement over the existing architectures. The proposed architecture provides re-configurability to achieve desired quality energy trade-off. It also exhibits significant reduced implementation complexity over the previous architectures.

#### III. EXPERIMENTAL RESULT & ANALYSIS

The existing and proposed DCT architectures are implemented on Tanner and MATLAB. Tanner is used to compute design metrics whereas MATLAB is used for the computation of error metrics. To simulate the design on the Tanner, schematics of the proposed DCT is first designed on Tanner's schematic editor and net-lists of these circuits are extracted from these schematics. These net-lists are then simulated with 45nm Predictive Technology Model file on spice simulator to achieve metrics. In the simulation area, power and delay are computed for each design.

#### 3.1 Simulation Results on Tanner

The proposed and existing DCT architectures are implemented on Tanner to evaluate the design metrics. The schematic of the proposed and existing DCT architectures are shown in Fig. 6.



Fig. 6: Proposed approximate DCT on Tanner

The schematic of the TEA-DCT and the bit-width aware DCT (BWA-DCT) are shown in Fig. 7 and Fig. 8 respectively. It can be observed that both exhibits similar architecture except that the BWA have different bit-width operation for each coefficients whereas TEA-DCT exhibits additional logic to detect and correct the error.



Fig. 7: Schematic of BWA-DCT on Tanner



Fig. 8: Schematic of TEA-DCT on Tanner

3.2 Simulation Results on MATLAB

The error metrics are evaluated by modelling the proposed and the existing adders on the MATLAB and simulating the designs with benchmark input images. The peak signal to noise ratio (PSNR) is computed and compared as shown in Table 1. The mathematical expression that computes the PSNR in decibel is given by the equation below.

$$PSNR_{db} = 10.\log_{10}\left(\frac{Sig_I^2}{MSE}\right) \tag{1}$$

where,  $Sig_I$  reflects the maximum signal value which for an image is 255.

| FSINK FOR VARIOUS DC1 FOR VARIING E1. |          |          |          |
|---------------------------------------|----------|----------|----------|
| ET                                    | TEA-DCT  | BWA-DCT  | Proposed |
| 1                                     | 28.10013 | 14.20352 | 28.08197 |
| 2                                     | 27.33193 | 14.03673 | 28.05133 |
| 3                                     | 24.99839 | 13.63131 | 27.98613 |
| 4                                     | 22.37093 | 13.19102 | 27.7231  |
| 5                                     | 19.36582 | 12.46905 | 27.29903 |
| 6                                     | 15.40877 | 12.2634  | 26.67023 |
| 7                                     | 14.10823 | 12.08157 | 26.04854 |
| 8                                     | 14.10823 | 11.83859 | 25.4486  |

TABLE 1PSNR FOR VARIOUS DCT FOR VARYING ET.

Fig. 10 compares PSNR for the different DCT where the proposed DCT exhibits higher value over the existing. Fig. 11 shows the reconstructed images of different DCT architecture for varying value of error tolerance. It can be seen from the figure that proposed DCT architecture provides images of higher quality over exiting.





#### **IV. CONCLUSION**

The DCT is the most frequent operation in several image/video processing applications such as image compression, high performance energy efficient architecture of DCT architectures is required. This paper explored existing energy efficient DCT architectures. These architectures are evaluated and compared. It is observed that it seen that most existing design exploits the relative significant of DCT coefficients to achieve different approximation in the designs. These designs provide different tradeoff between design metrics and the quality metrics.

#### REFERENCES

[1]. N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete cosine transform," *Computers, IEEE Transactions on*, vol. 100, no. 1, pp. 90–93, (**1974**).

[2]. J. Han and M. Orshansky, "Approximate computing: An emerging paradigm for energy-efficient design," in *Test Symposium (ETS)*, 2013 18th IEEE European, (2013), pp. 1–6.

[3]. S. Ghosh and K. Roy, "Parameter variation tolerance and error resiliency: New design paradigm for the nanoscale era," *Proceedings of the IEEE*, vol. 98, no. 10, pp. 1718–1751, (**2010**).

[4]. L. N. Chakrapani, K. K. Muntimadugu, L. Avinash, J. George, and K. V. Palem, "Highly energy and performance efficient embedded computing through approximately correct arithmetic: a mathematical foundation and preliminary exp. validation," *CASES*, pp. 187–196, (**2008**).

[5]. S. Mukhopadhyay, K. Kang, H. Mahmoodi, and K. Roy, "Reliable and self-repairing SRAM in nano-scale technologies using leakage and delay monitoring," in *Test Conference, 2005. Proceedings. ITC 2005. IEEE International*, (2005), p. 10–pp.

[6]. S. Kwon, J. Park, and K. Roy, "DCT processor architecture based on computation sharing," in *Circuit and Syst. for Comm.*, 2002. Proc. ICCSC'02. IEEE Int. Conf. on, (2002), pp. 162–165.

[7]. J. Park, J. H. Choi, and K. Roy, "Dynamic bit-width adaptation in DCT: An approach to trade off image quality and computation energy," *VLSI, IEEE Tran on*, vol. 18, no. 5, pp. 787–793, (2010).

[8]. D. Mohapatra, V. K. Chippa, A. Raghunathan, and K. Roy, "Design of voltage-scalable meta-functions for approximate computing.," in *DATE*, (2011), pp. 950–955.

[9]. K. He, A. Gerstlauer, and M. Orshansky, "Controlled timing-error acceptance for low energy IDCT design," in *Design*, *Automation Test in Europe Conference Exhibition (DATE)*, 2011, pp. 1–6.

[10]. K. He, A. Gerstlauer, and M. Orshansky, "Circuit-Level Timing-Error Acceptance for Design of Energy-Efficient DCT/IDCT-Based Systems," *CAS-VT, IEEE Tran. on*, vol. 23, no. 6, pp. 961–974, Jun. (**2013**).

[11]. Y. Jiang and M. Pattichis, "Dynamically reconfigurable DCT architectures based on bitrate, power, and image quality considerations," *ICIP*, 2012 19th IEEE Int. Conf. on, (2012), pp. 2465–2468.

[12]. Tutorial from Cambridge in Color: http://www.cambridgeincolour.com/tutorials/cameras-vs-human-eye.htm.

[13]. Various Benchmark inputs for image processing available at: http://sipi.usc.edu/database/."