See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/234660272
A custom VLSI architecture for the solution of FDTD equations
Article
in
IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences · March 2002
CITATIONS
26
READS
48
5 authors
, including:
Some of the authors of this publication are also working on these related projects:
Development of ultra wide band analog predistorsion of LASER diodes for Radio over Fiber (RoF) systemsView projectRealtime Active PIxel sensor Dosimeter (RAPID)
View projectPisana PlacidiUniversità degli Studi di Perugia
164
PUBLICATIONS
3,751
CITATIONS
SEE PROFILE
Luca RoselliUniversità degli Studi di Perugia
264
PUBLICATIONS
1,731
CITATIONS
SEE PROFILE
All content following this page was uploaded by Luca Roselli on 11 January 2017.
The user has requested enhancement of the downloaded file. All intext references underlined in blue are added to the srcinal documentand are linked to publications on ResearchGate, letting you access and read them immediately.
572
IEICE TRANS. ELECTRON., VOL.E85–C, NO.3 MARCH 2002
PAPER
Special Issue on Signals, Systems and Electronics Technology
A Custom VLSI Architecture for the Solution of FDTDEquations
Pisana PLACIDI
†
a)
, Leonardo VERDUCCI
†
, Guido MATRELLA
††
, Luca ROSELLI
†
,
and
Paolo CIAMPOLINI
††
,
Nonmembers
SUMMARY
In this paper, characteristics of a digital systemdedicated to the fast execution of the FDTD algorithm, widelyused for electromagnetic simulation, are presented. Such systemis conceived as a module communicating with a host personalcomputer via a PCI bus, and is based on a VLSI ASIC, whichimplements the “ﬁeldupdate” engine. The system structure isdeﬁned by means of a hardware description language, allowingto keep highlevel system speciﬁcation independent of the actualfabrication technology. A virtual implementation of the systemhas been carried out, by mapping such description in a standardcell style on a commercial 0.35
µ
m technology. Simulations showthat signiﬁcant speedup can be achieved, with respect to stateoftheart software implementations of the same algorithm.
key words:
1. Introduction
In recent years, the FDTD method [1] has became oneof the most widely used computational techniques forthe fullwave analysis of electromagnetic phenomena. Anumber of inherent characteristics makes such an algorithm almost ideal for the analysis of a wide class of microwave and highfrequency circuits [2]–[4]: its application to practical, threedimensional problems, however,is often limited by the demand of very large computational resources. In its basic formulation, in fact, thediscretization procedure exploits an orthogonal latticefeaturing
N
=
N
x
×
N
y
×
N
z
mesh cells, onto the edgesof which the vector components of electric and magnetic ﬁelds are mapped. The total memory requiredto store the ﬁelds and the update coeﬃcients thus increase as
O
(
N
), while, at ﬁrst order, one can roughlyassume the total number of time iteration to increase as
O
(
N
1
/
3
) [2]. The number of ﬂoatingpoint operation of the FDTD algorithm thus increases as
O
(
N
4
/
3
). Withrespect to diﬀerent discretization schemes, such a ﬁgure is a deﬁnitely favorable one; nevertheless,
N
stillcomes from a fully 3D volume discretization, so thatthe computational requirements quickly increase withthe domain complexity.
Manuscript received August 3, 2001.Manuscript revised October 30, 2001.
†
The authors are with the Dipartimento di IngegneriaElettronica e dell’informazione, University of Perugia, Italy.
††
The authors are with the Dipartimento di ingegneriadell’informazione, University of Parma, Italy.a)Email: placidi@diei.unipg.it
Thanks to microelectronic technologies, inexpensive RAMs and highspeed processors have becomeavailable, allowing for practical engineering applicationof the FDTD method to be faced. Recent microprocessors exploit RISC (Reduced Instruction Set Computer)architectures. By using short, ﬁxedlength instructionsand relatively simple address modes, pipelined and superscalar architectures can eﬀectively be implemented,allowing for large data throughputs to be achieved.Deep submicron technologies allows for embeddingwithin the processor fast and powerful ﬂoatingpointunits, thus resulting in large computational speed.However, the actual performance which can be achievedby means of highlevel programming is very much dependent on the operating system, the language compiler and the program structure [5].On the other hand, the FDTD algorithm, taking advantage of the simplicity and symmetry of discretized Maxwell’s equations, exhibits some featureswhich makes an hardware implementation appealing.The algorithm core, taking care of the timevaryingcomputation of the components of electric and magnetic ﬁeld, consists of six, triplenested loops. Fieldupdate loops are independent of each others, and canbe formulated without any branch, so that parallelingand pipelining techniques can be straightforwardly exploited to obtain better performance.Moreover, due to the intrinsic nature of a ﬁnitediﬀerence algorithm, updating a ﬁeld component at agiven location only involves a limited number of variables, located at nearestneighboring positions. By using suitable numbering techniques, simple and regularcoeﬃcient patterns can be obtained, so that the datacommunication bandwidth can be kept under control.Vector and massivelyparallel computers can thusbe exploited to achieve fast implementation of theFDTD algorithm [5]. The basic drawback of such anapproach consists of the huge hardware costs, whichlimits the development and diﬀusion of supercomputing facilities. The same characteristics recalled above,however, can be exploited to implement an eﬃcient (yetscalar), custom processor, at a fraction of the cost of ageneralpurpose supercomputer.In this paper, the architecture of a digital systemis described, dedicated to the solution of the FDTD algorithm on a 3D domain. The system architecture is
PLACIDI et al.: A CUSTOM VLSI ARCHITECTURE FOR THE SOLUTION OF FDTD EQUATIONS
573
built around a custom VLSI chip, which takes care of the ﬂoatingpoint operations needed to carry out theﬁeld updates: a superscalar, pipelined FPU has beendesigned to this purpose. The system is conceived as aPCB module, communicating with a host personal computer via a PCI bus. To speed up operations, communications with the host computer are kept at a minimum,and dedicated synchronous DRAM banks are exploitedfor the storage of variables and coeﬃcients: thanks tothe highly regular data pattern coming from the structured discretization grid, eﬃcient memoryscan protocols can be exploited.A virtual implementation of the system hasbeen carried out by specifying the whole architecturethrough the VHDL hardware description language: inparticular, the physical implementation of the customFPU has been carried out by mapping the network on acommercial technology. By adopting a standard cell design style, reliable timing characterization of the circuitcan be obtained by means of simulation, and the overallperformance of the system can be thus estimated. Expectations are that signiﬁcant speedup, with respect tostateoftheart software implementations of the FDTDalgorithm, can be achieved.This paper is organized as follows: in Sect.2 below,basics of the FDTD algorithm, applied to the solutionof Maxwell’s equations, are reviewed; Sect.3 illustratesthe system architecture, and details the structure of thecustom ﬂoatingpoint unit. Some results of the systemsimulation are shown in Sect.4, aiming at the estimateof the actual computational performance. Conclusionsare eventually drawn in Sect.5.
2. The FDTD Algorithm
Maxwell’s equations in an isotropic medium can bewritten as follows:
∂ B∂t
+
∇×
E
= 0
∂ D∂t
−∇×
H
=
−
J B
=
µ H D
=
ε E
(1)where the symbols have their usual meaning, and
J
,
µ
,and
ε
, are assumed to be a given functions of space andtime. According to the FDTD discretization method,the propagation domain is divided into cells, each cellbeing independently characterized in terms of material properties. Electromagnetic ﬁeld components aremapped at cell edges as shown in Fig.1, and Eqs.(1)are discretized, in both time and space, at each cell,using the (central) ﬁnite diﬀerence method.Referring to a given ﬁeld magneticﬁeld component(for instance
H
x
) the update equation reads:
(i,j,k+1)(i,j+1,k+1)(i,j,k) (i+1,j,k)
E
z
E
z
H
x
E
y
E
y
∆
x
∆
y
∆
z
xyzxy
zy
z
Fig.1
Mapping of ﬁeld components on Yee’s mesh.
H
x

n
+
12
i,j
+
12
,k
+
12
=
H
x

n
−
12
i,j
+
12
,k
+
12
+
c
1
E
z

ni,j
+1
,k
+
12
−
E
z

ni,j,k
+
12
+
c
2
E
y

ni,j
+
12
,k
−
E
y

ni,j
+
12
,k
+1
(2a)whereas the update equation for an electric ﬁeld component (for instance
E
x
) becomes:
E
x

n
+1
i
+
12
,j,k
=
E
x

ni
+
12
,j,k
+
c
3
H
z

n
+
12
i
+
12
,j
+
12
,k
−
H
z

n
+
12
i
+
12
,j
−
12
,k
+
c
4
H
y

n
+
12
i
+
12
,j,k
−
12
−
H
y

n
+
12
i
+
12
,j,k
+
12
(2b)Here, the standard Yee’s notation has beenadopted: the superscripts refer to the time iteration,whereas the triplets at the subscript indicate the spatial location of the ﬁeld components.In Eqs.(2a) and (2b), coeﬃcients
c
1
,
c
2
,
c
3
and
c
4
are kept constant throughout the simulation: theirvalue depends only on the actual material, on the discretization meshsize and on the time step adopted [1].The same discretization procedure is applied to allcomponents at each cell, this eventually resulting in alarge set of algebraic linear equations. However, thanksto the socalled “leapfrog” scheme, a computationallyeﬀective way to face solution of such an algebraic system can be found, which makes inversion of huge, sparsematrices unnecessary.Basically, such a scheme involves evaluation of timederivatives of electric and magneticﬁeld components atalternate time intervals (denoted by integer and fractional superscripts, respectively, in Eq.(2) above): according to such a scheme, ﬁeld updates depends only onquantities computed at previous time iterations, so thatupdate equations pertaining to a given time step can bedecoupled and independently solved (i.e., sequentially,in any arbitrary order). Six scalar equations per cell areeventually obtained, which share the same structure:
y
=
a
+
k
1
(
b
−
c
) +
k
2
(
d
−
e
) +
j
s
(3)
574
IEICE TRANS. ELECTRON., VOL.E85–C, NO.3 MARCH 2002
FPUPF U
SDRAM
REG
R E G s P C I T A R G E T B U S P C I B U S P C I B U S P C I
CU
CUCU CU CU CU CU
F U
REG
R E G s P C I T A R G E T B U S P C I B U S P C I B U S P C I B U S P C I B U S P C I
CU
CUCU CU CU CU CU
Fig.2
Overall system architecture.
In the above equation,
y
represents the ﬁeld component to be updated,
a
is its current value (i.e., thatcomputed at the previous time step),
k
1
and
k
2
arematerialrelated constant coeﬃcients,
b
,
c
,
d
and
e
areknown ﬁeld components (i.e., previously computed atneighboring grid points) and
j
s
represents the ﬁeldsource, if any.
3. System Architecture
The overall architecture is depicted in Fig.2, and includes a dedicated ﬂoatingpoint unit (FPU), some registers banks, a PCI Target Interface Circuit (PCI TARGET), a set of control unit (CU) and several externalSDRAM memories.Five 32 bit data bus have been considered, to allow for parallel data fetching from the SDRAM banks:actually, all data required to update the ﬁeld components from the onboard SDRAM are obtained in justtwo read operations; if a generalpurpose processor wereused, eight operation would have instead been necessary. Ideally, implementing eight busses (plus 8 set of SDRAM control signals) would have allowed for fetching all of the data in a single read cycle. However,this would have prohibitively impacted on the I/O pincount, and on the chip and board signal routing.By limiting ourselves to ﬁve busses, a reasonabletradeoﬀ between the total throughput of the systemand the number of chip I/O pin is obtained: due to sequential data fetch, this solution also requires some additional registers for temporary data storage. No databuﬀering is instead needed for output data, which aredirectly written back into the SDRAM banks.Design has been driven by performance and scalability concerns. The architecture has been speciﬁed inVHDL language [6]; by means of such a formalism, thesystem is described within a standard, industrial design ﬂow, allowing for simulation tools to be exploitedto estimate performance and, in a further perspective,
ak
1
bck
2
dey
d d d d d
d d
Fig.3
FPU architecture (ﬁeld source input not shown, i.e.,
j
s
= 0).
to keep the design relatively independent of the actualfabrication technology.Equation (3) actually involves only arithmetic addition and multiplication operations, so that the FPUdatapath can be hierarchically designed, starting froma reduced set of modular operators. The designed FPU,illustrated in Fig.3, is a fully pipelined unit, based onthe IEEE754 standard ﬂoatingpoint number representation, and includes adders and multipliers. Registersare used as delay units (
d
), to keep data aligned withinthe pipeline. The ﬂoatingpoint unit has 7, 32 bit, parallel inputs and is capable of working out an updatedﬁeld component per clock cycle. Generalpurpose microprocessors, instead, generally include a smaller number of FP operators, and hence require to split the update computation over several clock cycles.As stated above, the FPU has been mapped ona deepsubmicron, commercial technology (ALCATELMTC45000, 0.35
µ
m), available for multiproject waferruns. Propagation delays throughout the datapathhave thus been taken into account: in this ﬁrst version, the design timing has been targeted to a 166MHzoperating frequency: this relatively loose requirementcomes from the consideration that, as detailed below,the system throughput is actually limited by communication with SDRAM modules. Propagation times of 3.5 and 4ns were obtained for the adder and the multiplier, respectively. The PCI TARGET module manages the data ﬂow between the FDTD system and thehost PC: it complies with standard PCI speciﬁcations[7], so that highlevel control routines, running on thehost processor, can take care of initial data loading,of output data management and of exception handling.Highspeed communication between the FPU and RAMmodules, instead, is managed by the onchip ControlUnit (CU). Such unit is actually split in a set of parallelsubunits, which take care of the data ﬂow between theSDRAM and the FPU and the PCI interface: by issuingproper instructions to the memory and the datapath,subsequent operations needed by the FDTD algorithm
PLACIDI et al.: A CUSTOM VLSI ARCHITECTURE FOR THE SOLUTION OF FDTD EQUATIONS
575
(loading of the constant coeﬃcients, initialization of thesystem variables, acquisition of the ﬁeld external stimuli, actual ﬁeldcomponent update, writeback of theresults to the host system) can be executed. In orderto avoid stalls of the pipelined FPU, fresh data fromthe SDRAM should be accessed in a synchronous fashion (i.e., fetching a new set of consistent data per clockcycle). This goal can be achieved by exploiting thestreamingmode (“burst”) access of SDRAM’s. To thispurpose, however, contents of the SDRAM memoriesneed to be properly organized, taking advantage of theregular structure of the discretization mesh. I.e., by ﬁxing 2 out of 3 coordinates in the space, a straight row of cells is selected: for each cell in the row, and due to thetranslational symmetry of the FDTD mesh, ﬁeld coeﬃcients appearing in homologous position within Eq.(3)refer to aligned mesh cells as well.Hence, it is convenient to store ﬁeld componentsrelative to aligned cells into the same memory row, asillustrated in Fig.4: the main update loop then scansaligned cells, so that required coeﬃcients come, in sequence, from adjacent RAM locations, as shown byFig.5. This, thanks also to the parallelbus architecture, allows the optimization of RAM communication;taking advantage of burst access, fully synchronous operation can be achieved and no pipeline stall, due to
FDTD mesh SDRAM memory row
(i,j,k)
H
y
(i,j+1,k)(i,j+2,k)(i,j+N
y
1,k)
H
y
i,j+Ny1,k
H
y
i,j,k
H
y
i,j+1,k
xy
z(i,j,k)
H
y
(i,j+1,k)(i,j+2,k)(i,j+N
y
1,k)
H
y
i,j+Ny1,k
H
y
i,j,k
H
y
i,j+1,k
xy
zxy
z
Fig.4
Storage of the ﬁeld components into SDRAM memories.FDTD mesh SDRAM memory rows
jy j+1
j
j+1yxyz
H
y
i,j,k
H
y
i,j,k
n+1/2n1/2
E
z
i,j,k+1/2
n
E
z
i,j,k1/2
n
E
x
i +1/2,j,k
n
E
x
i 1/2,j,k
n
H
y
i,j+1,k
n+1/2
H
y
i,j+1,k
n1/2
E
z
i,j+1,k+1/2
n
E
z
i,j+1,k1/2
n
E
x
i +1/2,j+1,k
n
E
x
i 1/2,j+1,k
n
jy j+1
j
j+1yxyzxy
z
H
y
i,j,k
H
y
i,j,k
n+1/2n1/2
E
z
i,j,k+1/2
n
E
z
i,j,k1/2
n
E
x
i +1/2,j,k
n
E
x
i 1/2,j,k
n
H
y
i,j+1,k
n+1/2
H
y
i,j+1,k
n1/2
E
z
i,j+1,k+1/2
n
E
z
i,j+1,k1/2
n
E
x
i +1/2,j+1,k
n
E
x
i 1/2,j+1,k
n
Fig.5
Field update and memory update scheme.
data latency, may occur while scanning a row: just tworead cycles are needed to access input data, and onesingle cycle is required to store the result. Pipelinestalls may only occur when jumping from a row to thesubsequent one: it is therefore convenient to scan themesh along the longest edge (if any) of the device under inspection. In the testcase at hand, we assumedto deal with 8Mbyte SDRAM chips, which would allow to face problems featuring up to two millions of ﬁeld components. This is actually quite a reasonableﬁgure for a wide range of practical cases: nevertheless,much larger SDRAM are being designed and produced,which could be exploited almost straightforwardly (onlyminor changes would be needed in the memory controlsection, to manage a larger address space). In any case,the SDRAM physical size does not necessarily represent an upper limit for the meshsize: since, due to thediscretization algorithm, equations are decoupled, thesimulation domain can be easily partitioned into separate segments. At the expense of some overhead andof increased complexity in the control software (i.e., aroutine would have to manage swapping of the meshsegments), each segment could be independently considered by the FDTD engine.
4. Performance Estimate
To evaluate the performance of the proposed architecture, and to allow for comparison with software implementation on a general purpose architecture, a fairlysimple case has been considered: a cubic resonantcavity has been simulated, discretized by means of a21
×
21
×
21 cells. After a pulsed stimulus is injectedat a given location (accounting for a nonzero initialﬁeld) the transient simulation is carried out: given suchan elementary structure, it makes little sense, here, tolook at the simulation results in terms of actual time