Self Improvement

Transform Processing on a Reconfigurable Data Path Processor

Description
Transform Processing on a Reconfigurable Data Path Processor
Published
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  7th NASA Symposium on VLSI Design 1998  7.4.1 Transform Processing on aReconfigurable Data Path Processor 1 K. Joe Hass David F. Cox  jhass@mrc.unm.edu dcox@mrc.unm.edu  NASA Institute of Advanced MicroelectronicsMicroelectronics Research CenterUniversity of New Mexico801 University Blvd. SE, Suite 206Albuquerque, New Mexico 87106 Abstract –  A Reconfigurable Data Path Processor is capable of high performancetransform processing. The Fast Fourier Transform is used to investigate how theRDPP architecture influences its ability to execute these algorithms. Directedgraphs are used to visualize the algorithm, which is then implemented in theunique programming language of the RDPP. 1 Introduction Previous efforts resulted in the development of a reconfigurable data path processor, con-sisting of a number of identical data path elements interconnected with a wide crossbarbus. Each data path element is capable of performing simple logic functions as well as amultiply-accumulate (MAC) step, making the data path processor well suited to digital signalprocessing applications. This paper describes the implementation of transform algorithms,such as the Fast Fourier Transform (FFT) on the reconfigurable processor. Conventionalsoftware descriptions of these algorithms are converted to directed graphs in order to visu-alize and explore the flow of data. These graphs then suggest possible configurations of thereconfigurable processor. A behavioral VHDL model of the processor is used to evaluate ar-chitectural tradeoffs, while gate level models of a data path element were created to examinedesign issues at a lower level. These models have suggested alternate configurations of theprocessor that are better suited to transform algorithms. A comparison of these architec-tures is presented, along with estimates of processing performance for several algorithms of interest. 2 The Fast Fourier Transform The Fourier Transform is used to convert some function of time, h ( t ), to its equivalent repre-sentation, H  ( f  ), in the frequency domain. In practice, the transform is typically performedon a set of data points, h n , obtained by sampling h ( t ) at equally spaced time intervals. The 1 This research was supported by NASA under Space Engineering Research Grant NAGW-3293.  7.4.2Discrete Fourier Transform, or DFT, is used in this case: H  n ≡ N  − 1  k =0 h k e  j 2 πkn/N  where N  is the number of samplesThe DFT transforms N  complex numbers (the sampled data points) into N  complex numbersthat represent the same data in the frequency domain. Each H  n computation requires thesummation of  N  product terms, so the complete transform requires O ( N  2 ) operations. Forlarge sample sizes this brute-force calculation is not practical.More efficient methods for calculating the DFT have been investigated for nearly 200years, starting with Gauss in 1805. In 1942, Danielson and Lanczos showed that a DFT of length N  can be performed by adding the results of two DFTs of length N/ 2, where one of the DFTs uses the even numbered elements from the srcinal data set and the other DFTuses the odd numbered elements. This can be repeated, dividing the srcinal data set intosmaller and smaller pieces until the DFT of only two samples is performed. This conceptwas popularized in the context of electronic computers in 1965 by J. W. Cooley and J. W.Tukey and is commonly known as the Cooley-Tukey Fast Fourier Transform  .The basic building block of the FFT is the butterfly  operator, shown in Figure 1. Itperforms these two computations: H  (0) = h (0) + h (1) (1) H  (1) = W  k ( h (0) − h (1)) (2)Since all of the input and output values are complex numbers, this butterfly actually requiresten real operations: three additions, three subtractions and four multiplications. The W  k ‘twiddle factors’ are themselves complex numbers of the form W  k = e −  j 2 πk/N  = cos(2 πk/N  ) −  j sin(2 πk/N  )The butterfly is usually drawn without explicitly showing the addition or subtraction, andit is understood that whenever two or more arrows meet some addition or subtraction isimplied. W-+H(0)H(1)h(1)h(0) k Figure 1: Radix-2 FFT ButterflyThe butterfly operator can be used to compute the FFT of any set of data points where N  is of the form N  = 2 m . Figure 2 illustrates this process for N  = 8. In the first stage,the butterfly operates on data points h(0) and h(4) , then on points h(1) and h(5) , and so onuntil the data is exhausted. The second stage processes the results from the first stage, nowworking with data points that are only two locations apart. Finally, adjacent data elementsare combined in the third stage to produce the transform outputs, H(n) .  7th NASA Symposium on VLSI Design 1998  7.4.3 WWWWWWWWh(0)WWWWStage 1 Stage 2 Stage 3 000000022231 H(7)H(3)H(5)H(1)H(6)H(2)H(4)H(0)h(7)h(6)h(5)h(4)h(3)h(2)h(1) Figure 2: FFT for 8 data pointsThe butterfly operator discussed above is called a radix-2 butterfly because it processes 2data elements in parallel. Butterfly operators for higher radices can be derived, such as theradix-4 butterfly shown in Figure 3. The radix-4 butterfly replaces four radix-2 butterflies,reducing the number of multiplies by 25%. However, each transform result is based on a sumand difference term derived from all four inputs so the number of additions is only reducedby 8.3%. Of particular significance for hardware FFT processors is the fact that an FFTbased on the radix-4 butterfly needs only m = log 4 N  stages which can reduce the processorclock rate and data memory bandwidth requirements. Radix-8 and radix-16 butterflies havealso been used, although they are more complicated and require a multiplication inside thebutterfly as well as the twiddle factor scaling at the outputs. h(0)WWW 213 H(0)H(1)H(2)H(3)h(3)h(2)h(1) Figure 3: Radix-4 FFT Butterfly  7.4.4Butterflies with different radices can be combined for any given value of  N  . For example,a 1024-point FFT can be performed using two passes with a radix-16 butterfly followed by asingle pass with a radix-4 butterfly because 1024 = 16 × 16 × 4. If the data rate for this FFTwas 1 × 10 6 samples per second, then the bandwidth for the data into the FFT processormust be three times as high, or 3 × 10 6 complex data words per second, in order to processdata sets as fast as they arrive. A processor that had only a radix-2 butterfly would needten passes, with a data bandwidth of 10 × 10 6 complex words per second, to process this datain real time. 3 The FFT in a Programmable Parallel Processor The butterfly operator represents a much different computational problem than typical FIRand IIR digital filters. The FFT algorithm is not a simple recursive multiply-and-accumulateprocess, it processes data in relatively small blocks, and the fact that all data values are com-plex numbers doesn’t make matters any easier. However, there is a great deal of parallelismthat can be exploited if appropriate hardware resources are available. This can be seen inthe directed graph of the radix-4 algorithm shown in Figure 4, which was adapted froma FORTRAN program. The graph can be used to visualize how this algorithm could bemapped into a highly parallel hardware implementation.The column of triangles in the center of the graph represent the data inputs. Inputvalues x (0) through x (3) are the real portion of the data points while y (0) through y (3)are the imaginary portion. In other words, h ( n ) = x ( n ) + j y ( n ). Similarly, cos( a ) throughcos( c ) are the real portion of the complex twiddle factors while sin( a ) through sin( c ) are theimaginary portion. Ellipses represent an intermediate computation and rectangles representa computation that produces an output value. Processing proceeds from the top of the graphto the bottom, with each row corresponding to a sequential time step in the algorithm.A reconfigurable data path architecture that may be well suited to FFT algorithms hasbeen described. This architecture was later modified slightly to improve the utilization of its data path elements . Basically, the data path consists of a number of data path elements(DPE) interconnected by a global bus. Each DPE has a single output port that drives aportion of the global bus and a set of input ports that can read the output of any otherDPE. If there are n DPEs and the width of their output ports is m bits, then the global buswidth is n × m bits.A simplified block diagram for a single DPE is shown in Figure 5. The global bus entersthe DPE at the top of the diagram and feeds four m :1 multiplexers (labeled MUXA through MUXD ). Each multiplexer has n × m inputs and m outputs. Two of the multiplexers are con-nected to data storage registers ( DREG1 and DREG2 ) that can hold data values for later use.Two more multiplexers ( SEL ) select two of the data inputs for the multiplier/accumulator( MAC ). The other two MAC inputs come from the logic units ( LOGIC ), which can performsimple Boolean operations on their inputs. The MAC computes the product of two of itsinputs and adds that result to the other two inputs. The output register may be loaded withthe MAC output (unshifted, shifted right, or shifted left) or may hold its current value. Thedata path elements are designed to complete all of these operations in a single clock period.  7th NASA Symposium on VLSI Design 1998  7.4.5 x(0)X(3)=r1*cos(a) + s1*sin(a)y(0)X(2)=r2*cos(b) + s2*sin(b)x(2)X(1)=r3*cos(a) + s3*sin(a)y(2)Y(0)=s0+y(3)x(1)Y(3)=s1*cos(b) - r1*sin(b)y(1)Y(2)=s2*cos(b) - r2*sin(b)x(3)Y(1)=s3*cos(a) - r3*sin(a)y(3)sin(a)cos(a)sin(b)cos(b)sin(c)cos(c)r0 = x(0)r0 += x(2)r0 += x(1)s3 *= sin(a)s2 *= sin(b)s1 *= sin(c)r1 = r0-x(2)r1 -= y(1)r1 -= y(3)r2 = r0-x(1)r2 -= x(3)r3 = r1+y(1)r3 -= y(3)s0 = y(0)s0 += y(2)s0 += y(1)r3 *= sin(a)r2 *= sin(b)r1 *= sin(c)s1 = s0-y(2)s1 += x(1)s1 -= x(3)s2 = s0-y(1)s2 -= y(3)s3 = s1-x(1)s3 += x(3)X(0)=r0+x(3) Figure 4: Radix-4 Butterfly Directed Graph
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks