Description

Transform Processing on a Reconfigurable Data Path Processor

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

7th NASA Symposium on VLSI Design 1998
7.4.1
Transform Processing on aReconﬁgurable Data Path Processor
1
K. Joe Hass David F. Cox
jhass@mrc.unm.edu dcox@mrc.unm.edu
NASA Institute of Advanced MicroelectronicsMicroelectronics Research CenterUniversity of New Mexico801 University Blvd. SE, Suite 206Albuquerque, New Mexico 87106
Abstract –
A Reconﬁgurable Data Path Processor is capable of high performancetransform processing. The Fast Fourier Transform is used to investigate how theRDPP architecture inﬂuences its ability to execute these algorithms. Directedgraphs are used to visualize the algorithm, which is then implemented in theunique programming language of the RDPP.
1 Introduction
Previous eﬀorts resulted in the development of a reconﬁgurable data path processor, con-sisting of a number of identical data path elements interconnected with a wide crossbarbus. Each data path element is capable of performing simple logic functions as well as amultiply-accumulate (MAC) step, making the data path processor well suited to digital signalprocessing applications. This paper describes the implementation of transform algorithms,such as the Fast Fourier Transform (FFT) on the reconﬁgurable processor. Conventionalsoftware descriptions of these algorithms are converted to directed graphs in order to visu-alize and explore the ﬂow of data. These graphs then suggest possible conﬁgurations of thereconﬁgurable processor. A behavioral VHDL model of the processor is used to evaluate ar-chitectural tradeoﬀs, while gate level models of a data path element were created to examinedesign issues at a lower level. These models have suggested alternate conﬁgurations of theprocessor that are better suited to transform algorithms. A comparison of these architec-tures is presented, along with estimates of processing performance for several algorithms of interest.
2 The Fast Fourier Transform
The Fourier Transform is used to convert some function of time,
h
(
t
), to its equivalent repre-sentation,
H
(
f
), in the frequency domain. In practice, the transform is typically performedon a set of data points,
h
n
, obtained by sampling
h
(
t
) at equally spaced time intervals. The
1
This research was supported by NASA under Space Engineering Research Grant NAGW-3293.
7.4.2Discrete Fourier Transform, or DFT, is used in this case:
H
n
≡
N
−
1
k
=0
h
k
e
j
2
πkn/N
where
N
is the number of samplesThe DFT transforms
N
complex numbers (the sampled data points) into
N
complex numbersthat represent the same data in the frequency domain. Each
H
n
computation requires thesummation of
N
product terms, so the complete transform requires
O
(
N
2
) operations. Forlarge sample sizes this brute-force calculation is not practical.More eﬃcient methods for calculating the DFT have been investigated for nearly 200years, starting with Gauss in 1805. In 1942, Danielson and Lanczos showed that a DFT of length
N
can be performed by adding the results of two DFTs of length
N/
2, where one of the DFTs uses the even numbered elements from the srcinal data set and the other DFTuses the odd numbered elements. This can be repeated, dividing the srcinal data set intosmaller and smaller pieces until the DFT of only two samples is performed. This conceptwas popularized in the context of electronic computers in 1965 by J. W. Cooley and J. W.Tukey and is commonly known as the
Cooley-Tukey Fast Fourier Transform
.The basic building block of the FFT is the
butterﬂy
operator, shown in Figure 1. Itperforms these two computations:
H
(0) =
h
(0) +
h
(1) (1)
H
(1) =
W
k
(
h
(0)
−
h
(1)) (2)Since all of the input and output values are complex numbers, this butterﬂy actually requiresten real operations: three additions, three subtractions and four multiplications. The
W
k
‘twiddle factors’ are themselves complex numbers of the form
W
k
=
e
−
j
2
πk/N
= cos(2
πk/N
)
−
j
sin(2
πk/N
)The butterﬂy is usually drawn without explicitly showing the addition or subtraction, andit is understood that whenever two or more arrows meet some addition or subtraction isimplied.
W-+H(0)H(1)h(1)h(0)
k
Figure 1: Radix-2 FFT ButterﬂyThe butterﬂy operator can be used to compute the FFT of any set of data points where
N
is of the form
N
= 2
m
. Figure 2 illustrates this process for
N
= 8. In the ﬁrst stage,the butterﬂy operates on data points
h(0)
and
h(4)
, then on points
h(1)
and
h(5)
, and so onuntil the data is exhausted. The second stage processes the results from the ﬁrst stage, nowworking with data points that are only two locations apart. Finally, adjacent data elementsare combined in the third stage to produce the transform outputs,
H(n)
.
7th NASA Symposium on VLSI Design 1998
7.4.3
WWWWWWWWh(0)WWWWStage 1 Stage 2 Stage 3
000000022231
H(7)H(3)H(5)H(1)H(6)H(2)H(4)H(0)h(7)h(6)h(5)h(4)h(3)h(2)h(1)
Figure 2: FFT for 8 data pointsThe butterﬂy operator discussed above is called a radix-2 butterﬂy because it processes 2data elements in parallel. Butterﬂy operators for higher radices can be derived, such as theradix-4 butterﬂy shown in Figure 3. The radix-4 butterﬂy replaces four radix-2 butterﬂies,reducing the number of multiplies by 25%. However, each transform result is based on a sumand diﬀerence term derived from all four inputs so the number of additions is only reducedby 8.3%. Of particular signiﬁcance for hardware FFT processors is the fact that an FFTbased on the radix-4 butterﬂy needs only
m
= log
4
N
stages which can reduce the processorclock rate and data memory bandwidth requirements. Radix-8 and radix-16 butterﬂies havealso been used, although they are more complicated and require a multiplication inside thebutterﬂy as well as the twiddle factor scaling at the outputs.
h(0)WWW
213
H(0)H(1)H(2)H(3)h(3)h(2)h(1)
Figure 3: Radix-4 FFT Butterﬂy
7.4.4Butterﬂies with diﬀerent radices can be combined for any given value of
N
. For example,a 1024-point FFT can be performed using two passes with a radix-16 butterﬂy followed by asingle pass with a radix-4 butterﬂy because 1024 = 16
×
16
×
4. If the data rate for this FFTwas 1
×
10
6
samples per second, then the bandwidth for the data into the FFT processormust be three times as high, or 3
×
10
6
complex data words per second, in order to processdata sets as fast as they arrive. A processor that had only a radix-2 butterﬂy would needten passes, with a data bandwidth of 10
×
10
6
complex words per second, to process this datain real time.
3 The FFT in a Programmable Parallel Processor
The butterﬂy operator represents a much diﬀerent computational problem than typical FIRand IIR digital ﬁlters. The FFT algorithm is not a simple recursive multiply-and-accumulateprocess, it processes data in relatively small blocks, and the fact that all data values are com-plex numbers doesn’t make matters any easier. However, there is a great deal of parallelismthat can be exploited if appropriate hardware resources are available. This can be seen inthe directed graph of the radix-4 algorithm shown in Figure 4, which was adapted froma FORTRAN program. The graph can be used to visualize how this algorithm could bemapped into a highly parallel hardware implementation.The column of triangles in the center of the graph represent the data inputs. Inputvalues
x
(0) through
x
(3) are the real portion of the data points while
y
(0) through
y
(3)are the imaginary portion. In other words,
h
(
n
) =
x
(
n
) +
j y
(
n
). Similarly, cos(
a
) throughcos(
c
) are the real portion of the complex twiddle factors while sin(
a
) through sin(
c
) are theimaginary portion. Ellipses represent an intermediate computation and rectangles representa computation that produces an output value. Processing proceeds from the top of the graphto the bottom, with each row corresponding to a sequential time step in the algorithm.A reconﬁgurable data path architecture that may be well suited to FFT algorithms hasbeen described. This architecture was later modiﬁed slightly to improve the utilization of its data path elements . Basically, the data path consists of a number of data path elements(DPE) interconnected by a global bus. Each DPE has a single output port that drives aportion of the global bus and a set of input ports that can read the output of any otherDPE. If there are
n
DPEs and the width of their output ports is
m
bits, then the global buswidth is
n
×
m
bits.A simpliﬁed block diagram for a single DPE is shown in Figure 5. The global bus entersthe DPE at the top of the diagram and feeds four
m
:1 multiplexers (labeled
MUXA
through
MUXD
). Each multiplexer has
n
×
m
inputs and
m
outputs. Two of the multiplexers are con-nected to data storage registers (
DREG1
and
DREG2
) that can hold data values for later use.Two more multiplexers (
SEL
) select two of the data inputs for the multiplier/accumulator(
MAC
). The other two
MAC
inputs come from the logic units (
LOGIC
), which can performsimple Boolean operations on their inputs. The
MAC
computes the product of two of itsinputs and adds that result to the other two inputs. The output register may be loaded withthe
MAC
output (unshifted, shifted right, or shifted left) or may hold its current value. Thedata path elements are designed to complete all of these operations in a single clock period.
7th NASA Symposium on VLSI Design 1998
7.4.5
x(0)X(3)=r1*cos(a) + s1*sin(a)y(0)X(2)=r2*cos(b) + s2*sin(b)x(2)X(1)=r3*cos(a) + s3*sin(a)y(2)Y(0)=s0+y(3)x(1)Y(3)=s1*cos(b) - r1*sin(b)y(1)Y(2)=s2*cos(b) - r2*sin(b)x(3)Y(1)=s3*cos(a) - r3*sin(a)y(3)sin(a)cos(a)sin(b)cos(b)sin(c)cos(c)r0 = x(0)r0 += x(2)r0 += x(1)s3 *= sin(a)s2 *= sin(b)s1 *= sin(c)r1 = r0-x(2)r1 -= y(1)r1 -= y(3)r2 = r0-x(1)r2 -= x(3)r3 = r1+y(1)r3 -= y(3)s0 = y(0)s0 += y(2)s0 += y(1)r3 *= sin(a)r2 *= sin(b)r1 *= sin(c)s1 = s0-y(2)s1 += x(1)s1 -= x(3)s2 = s0-y(1)s2 -= y(3)s3 = s1-x(1)s3 += x(3)X(0)=r0+x(3)
Figure 4: Radix-4 Butterﬂy Directed Graph

Search

Similar documents

Tags

Related Search

System On A ChipAcquisition of dental characters on a phylogeMorphological processing in a second languageUniversal Knowledge Processing Systems: A ConLab On A ChipCurrently working on a historical novel aboutLanguage processing in a Second LanguageCurrently working on a project called Who wasWorking on a cultural mechanics of my own devAlso working on a compendium of genera of Ang

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks