Public Notices

A custom VLSI architecture for the solution of FDTD equations

Description
A custom VLSI architecture for the solution of FDTD equations
Categories
Published
of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/234660272 A custom VLSI architecture for the solution of FDTD equations  Article   in  IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences · March 2002 CITATIONS 26 READS 48 5 authors , including: Some of the authors of this publication are also working on these related projects: Development of ultra wide band analog predistorsion of LASER diodes for Radio over Fiber (RoF) systemsView projectReal-time Active PIxel sensor Dosimeter (RAPID)   View projectPisana PlacidiUniversità degli Studi di Perugia 164   PUBLICATIONS   3,751   CITATIONS   SEE PROFILE Luca RoselliUniversità degli Studi di Perugia 264   PUBLICATIONS   1,731   CITATIONS   SEE PROFILE All content following this page was uploaded by Luca Roselli on 11 January 2017. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the srcinal documentand are linked to publications on ResearchGate, letting you access and read them immediately.  572 IEICE TRANS. ELECTRON., VOL.E85–C, NO.3 MARCH 2002 PAPER  Special Issue on Signals, Systems and Electronics Technology  A Custom VLSI Architecture for the Solution of FDTDEquations Pisana PLACIDI † a) , Leonardo VERDUCCI † , Guido MATRELLA †† , Luca ROSELLI † , and   Paolo CIAMPOLINI †† ,  Nonmembers SUMMARY  In this paper, characteristics of a digital systemdedicated to the fast execution of the FDTD algorithm, widelyused for electromagnetic simulation, are presented. Such systemis conceived as a module communicating with a host personalcomputer via a PCI bus, and is based on a VLSI ASIC, whichimplements the “field-update” engine. The system structure isdefined by means of a hardware description language, allowingto keep high-level system specification independent of the actualfabrication technology. A virtual implementation of the systemhas been carried out, by mapping such description in a standard-cell style on a commercial 0.35 µ m technology. Simulations showthat significant speed-up can be achieved, with respect to state-of-the-art software implementations of the same algorithm. key words:    1. Introduction In recent years, the FDTD method [1] has became oneof the most widely used computational techniques forthe full-wave analysis of electromagnetic phenomena. Anumber of inherent characteristics makes such an algo-rithm almost ideal for the analysis of a wide class of mi-crowave and high-frequency circuits [2]–[4]: its applica-tion to practical, three-dimensional problems, however,is often limited by the demand of very large computa-tional resources. In its basic formulation, in fact, thediscretization procedure exploits an orthogonal latticefeaturing  N   =  N  x × N  y × N  z  mesh cells, onto the edgesof which the vector components of electric and mag-netic fields are mapped. The total memory requiredto store the fields and the update coefficients thus in-crease as  O ( N  ), while, at first order, one can roughlyassume the total number of time iteration to increase as O ( N  1 / 3 ) [2]. The number of floating-point operation of the FDTD algorithm thus increases as  O ( N  4 / 3 ). Withrespect to different discretization schemes, such a fig-ure is a definitely favorable one; nevertheless,  N   stillcomes from a fully 3D volume discretization, so thatthe computational requirements quickly increase withthe domain complexity. Manuscript received August 3, 2001.Manuscript revised October 30, 2001. † The authors are with the Dipartimento di IngegneriaElettronica e dell’informazione, University of Perugia, Italy. †† The authors are with the Dipartimento di ingegneriadell’informazione, University of Parma, Italy.a)E-mail: placidi@diei.unipg.it Thanks to microelectronic technologies, inexpen-sive RAMs and high-speed processors have becomeavailable, allowing for practical engineering applicationof the FDTD method to be faced. Recent microproces-sors exploit RISC (Reduced Instruction Set Computer)architectures. By using short, fixed-length instructionsand relatively simple address modes, pipelined and su-perscalar architectures can effectively be implemented,allowing for large data throughputs to be achieved.Deep submicron technologies allows for embeddingwithin the processor fast and powerful floating-pointunits, thus resulting in large computational speed.However, the actual performance which can be achievedby means of high-level programming is very much de-pendent on the operating system, the language com-piler and the program structure [5].On the other hand, the FDTD algorithm, tak-ing advantage of the simplicity and symmetry of dis-cretized Maxwell’s equations, exhibits some featureswhich makes an hardware implementation appealing.The algorithm core, taking care of the time-varyingcomputation of the components of electric and mag-netic field, consists of six, triple-nested loops. Field-update loops are independent of each others, and canbe formulated without any branch, so that parallelingand pipelining techniques can be straightforwardly ex-ploited to obtain better performance.Moreover, due to the intrinsic nature of a finite-difference algorithm, updating a field component at agiven location only involves a limited number of vari-ables, located at nearest-neighboring positions. By us-ing suitable numbering techniques, simple and regularcoefficient patterns can be obtained, so that the data-communication bandwidth can be kept under control.Vector and massively-parallel computers can thusbe exploited to achieve fast implementation of theFDTD algorithm [5]. The basic drawback of such anapproach consists of the huge hardware costs, whichlimits the development and diffusion of supercomput-ing facilities. The same characteristics recalled above,however, can be exploited to implement an efficient (yetscalar), custom processor, at a fraction of the cost of ageneral-purpose supercomputer.In this paper, the architecture of a digital systemis described, dedicated to the solution of the FDTD al-gorithm on a 3D domain. The system architecture is  PLACIDI et al.: A CUSTOM VLSI ARCHITECTURE FOR THE SOLUTION OF FDTD EQUATIONS 573 built around a custom VLSI chip, which takes care of the floating-point operations needed to carry out thefield updates: a superscalar, pipelined FPU has beendesigned to this purpose. The system is conceived as aPCB module, communicating with a host personal com-puter via a PCI bus. To speed up operations, communi-cations with the host computer are kept at a minimum,and dedicated synchronous DRAM banks are exploitedfor the storage of variables and coefficients: thanks tothe highly regular data pattern coming from the struc-tured discretization grid, efficient memory-scan proto-cols can be exploited.A virtual implementation of the system hasbeen carried out by specifying the whole architecturethrough the VHDL hardware description language: inparticular, the physical implementation of the customFPU has been carried out by mapping the network on acommercial technology. By adopting a standard cell de-sign style, reliable timing characterization of the circuitcan be obtained by means of simulation, and the overallperformance of the system can be thus estimated. Ex-pectations are that significant speed-up, with respect tostate-of-the-art software implementations of the FDTDalgorithm, can be achieved.This paper is organized as follows: in Sect.2 below,basics of the FDTD algorithm, applied to the solutionof Maxwell’s equations, are reviewed; Sect.3 illustratesthe system architecture, and details the structure of thecustom floating-point unit. Some results of the systemsimulation are shown in Sect.4, aiming at the estimateof the actual computational performance. Conclusionsare eventually drawn in Sect.5. 2. The FDTD Algorithm Maxwell’s equations in an isotropic medium can bewritten as follows: ∂   B∂t  + ∇×   E   = 0 ∂   D∂t  −∇×   H   = −  J  B  =  µ H  D  =  ε E   (1)where the symbols have their usual meaning, and  J  ,  µ ,and  ε , are assumed to be a given functions of space andtime. According to the FDTD discretization method,the propagation domain is divided into cells, each cellbeing independently characterized in terms of mate-rial properties. Electromagnetic field components aremapped at cell edges as shown in Fig.1, and Eqs.(1)are discretized, in both time and space, at each cell,using the (central) finite difference method.Referring to a given field magnetic-field component(for instance  H  x ) the update equation reads: (i,j,k+1)(i,j+1,k+1)(i,j,k) (i+1,j,k) E z E z H x E y E y ∆ x  ∆ y ∆ z xyzxy   zy   z Fig.1  Mapping of field components on Yee’s mesh. H  x | n + 12 i,j + 12 ,k + 12 =  H  x | n − 12 i,j + 12 ,k + 12 +  c 1  E  z | ni,j +1 ,k + 12 − E  z | ni,j,k + 12  +  c 2  E  y | ni,j + 12 ,k − E  y | ni,j + 12 ,k +1   (2a)whereas the update equation for an electric field com-ponent (for instance  E  x ) becomes: E  x | n +1 i + 12 ,j,k  =  E  x | ni + 12 ,j,k +  c 3  H  z | n + 12 i + 12 ,j + 12 ,k − H  z | n + 12 i + 12 ,j − 12 ,k  +  c 4  H  y | n + 12 i + 12 ,j,k − 12 − H  y | n + 12 i + 12 ,j,k + 12   (2b)Here, the standard Yee’s notation has beenadopted: the superscripts refer to the time iteration,whereas the triplets at the subscript indicate the spa-tial location of the field components.In Eqs.(2a) and (2b), coefficients  c 1 ,  c 2 ,  c 3  and c 4  are kept constant throughout the simulation: theirvalue depends only on the actual material, on the dis-cretization mesh-size and on the time step adopted [1].The same discretization procedure is applied to allcomponents at each cell, this eventually resulting in alarge set of algebraic linear equations. However, thanksto the so-called “leapfrog” scheme, a computationally-effective way to face solution of such an algebraic sys-tem can be found, which makes inversion of huge, sparsematrices unnecessary.Basically, such a scheme involves evaluation of timederivatives of electric and magnetic-field components atalternate time intervals (denoted by integer and frac-tional superscripts, respectively, in Eq.(2) above): ac-cording to such a scheme, field updates depends only onquantities computed at previous time iterations, so thatupdate equations pertaining to a given time step can bedecoupled and independently solved (i.e., sequentially,in any arbitrary order). Six scalar equations per cell areeventually obtained, which share the same structure: y  =  a  +  k 1 ( b − c ) +  k 2 ( d − e ) +  j s  (3)  574 IEICE TRANS. ELECTRON., VOL.E85–C, NO.3 MARCH 2002 FPUPF U SDRAM REG      R     E     G    s     P     C     I     T     A     R     G     E     T     B     U     S     P     C     I     B     U     S     P     C     I     B     U     S     P     C     I CU CUCU CU CU CU CU F U   REG      R     E     G    s     P     C     I     T     A     R     G     E     T     B     U     S     P     C     I     B     U     S     P     C     I     B     U     S     P     C     I     B     U     S     P     C     I     B     U     S     P     C     I CU CUCU CU CU CU CU Fig.2  Overall system architecture. In the above equation,  y  represents the field com-ponent to be updated,  a  is its current value (i.e., thatcomputed at the previous time step),  k 1  and  k 2  arematerial-related constant coefficients,  b ,  c ,  d  and  e  areknown field components (i.e., previously computed atneighboring grid points) and  j s  represents the fieldsource, if any. 3. System Architecture The overall architecture is depicted in Fig.2, and in-cludes a dedicated floating-point unit (FPU), some reg-isters banks, a PCI Target Interface Circuit (PCI TAR-GET), a set of control unit (CU) and several externalSDRAM memories.Five 32 bit data bus have been considered, to al-low for parallel data fetching from the SDRAM banks:actually, all data required to update the field compo-nents from the on-board SDRAM are obtained in justtwo read operations; if a general-purpose processor wereused, eight operation would have instead been neces-sary. Ideally, implementing eight busses (plus 8 set of SDRAM control signals) would have allowed for fetch-ing all of the data in a single read cycle. However,this would have prohibitively impacted on the I/O pincount, and on the chip and board signal routing.By limiting ourselves to five busses, a reasonabletradeoff between the total throughput of the systemand the number of chip I/O pin is obtained: due to se-quential data fetch, this solution also requires some ad-ditional registers for temporary data storage. No databuffering is instead needed for output data, which aredirectly written back into the SDRAM banks.Design has been driven by performance and scala-bility concerns. The architecture has been specified inVHDL language [6]; by means of such a formalism, thesystem is described within a standard, industrial de-sign flow, allowing for simulation tools to be exploitedto estimate performance and, in a further perspective, ak  1 bck  2 dey d d d d  d    d d  Fig.3  FPU architecture (field source input not shown, i.e.,  j s  = 0). to keep the design relatively independent of the actualfabrication technology.Equation (3) actually involves only arithmetic ad-dition and multiplication operations, so that the FPUdatapath can be hierarchically designed, starting froma reduced set of modular operators. The designed FPU,illustrated in Fig.3, is a fully pipelined unit, based onthe IEEE-754 standard floating-point number represen-tation, and includes adders and multipliers. Registersare used as delay units ( d ), to keep data aligned withinthe pipeline. The floating-point unit has 7, 32 bit, par-allel inputs and is capable of working out an updatedfield component per clock cycle. General-purpose mi-croprocessors, instead, generally include a smaller num-ber of FP operators, and hence require to split the up-date computation over several clock cycles.As stated above, the FPU has been mapped ona deep-submicron, commercial technology (ALCATELMTC45000, 0.35 µ m), available for multi-project waferruns. Propagation delays throughout the datapathhave thus been taken into account: in this first ver-sion, the design timing has been targeted to a 166MHzoperating frequency: this relatively loose requirementcomes from the consideration that, as detailed below,the system throughput is actually limited by commu-nication with SDRAM modules. Propagation times of 3.5 and 4ns were obtained for the adder and the mul-tiplier, respectively. The PCI TARGET module man-ages the data flow between the FDTD system and thehost PC: it complies with standard PCI specifications[7], so that high-level control routines, running on thehost processor, can take care of initial data loading,of output data management and of exception handling.High-speed communication between the FPU and RAMmodules, instead, is managed by the on-chip ControlUnit (CU). Such unit is actually split in a set of parallelsub-units, which take care of the data flow between theSDRAM and the FPU and the PCI interface: by issuingproper instructions to the memory and the datapath,subsequent operations needed by the FDTD algorithm  PLACIDI et al.: A CUSTOM VLSI ARCHITECTURE FOR THE SOLUTION OF FDTD EQUATIONS 575 (loading of the constant coefficients, initialization of thesystem variables, acquisition of the field external stim-uli, actual field-component update, write-back of theresults to the host system) can be executed. In orderto avoid stalls of the pipelined FPU, fresh data fromthe SDRAM should be accessed in a synchronous fash-ion (i.e., fetching a new set of consistent data per clockcycle). This goal can be achieved by exploiting thestreaming-mode (“burst”) access of SDRAM’s. To thispurpose, however, contents of the SDRAM memoriesneed to be properly organized, taking advantage of theregular structure of the discretization mesh. I.e., by fix-ing 2 out of 3 coordinates in the space, a straight row of cells is selected: for each cell in the row, and due to thetranslational symmetry of the FDTD mesh, field coeffi-cients appearing in homologous position within Eq.(3)refer to aligned mesh cells as well.Hence, it is convenient to store field componentsrelative to aligned cells into the same memory row, asillustrated in Fig.4: the main update loop then scansaligned cells, so that required coefficients come, in se-quence, from adjacent RAM locations, as shown byFig.5. This, thanks also to the parallel-bus architec-ture, allows the optimization of RAM communication;taking advantage of burst access, fully synchronous op-eration can be achieved and no pipeline stall, due to FDTD mesh SDRAM memory row (i,j,k) H y (i,j+1,k)(i,j+2,k)(i,j+N y -1,k) H y  i,j+Ny-1,k  H y  i,j,k   H y  i,j+1,k  xy   z(i,j,k) H y (i,j+1,k)(i,j+2,k)(i,j+N y -1,k) H y  i,j+Ny-1,k  H y  i,j,k   H y  i,j+1,k  xy   zxy   z Fig.4  Storage of the field components into SDRAM memories.FDTD mesh SDRAM memory rows  jy j+1  j  j+1yxyz H y  i,j,k  H y  i,j,k  n+1/2n-1/2 E z  i,j,k+1/2 n E z  i,j,k-1/2 n E x  i +1/2,j,k  n E x  i -1/2,j,k  n H y  i,j+1,k  n+1/2 H y  i,j+1,k  n-1/2 E z  i,j+1,k+1/2 n E z  i,j+1,k-1/2 n E x  i +1/2,j+1,k  n E x  i -1/2,j+1,k  n  jy j+1  j    j+1yxyzxy   z H y  i,j,k  H y  i,j,k  n+1/2n-1/2 E z  i,j,k+1/2 n E z  i,j,k-1/2 n E x  i +1/2,j,k  n E x  i -1/2,j,k  n H y  i,j+1,k  n+1/2 H y  i,j+1,k  n-1/2 E z  i,j+1,k+1/2 n E z  i,j+1,k-1/2 n E x  i +1/2,j+1,k  n E x  i -1/2,j+1,k  n Fig.5  Field update and memory update scheme. data latency, may occur while scanning a row: just tworead cycles are needed to access input data, and onesingle cycle is required to store the result. Pipelinestalls may only occur when jumping from a row to thesubsequent one: it is therefore convenient to scan themesh along the longest edge (if any) of the device un-der inspection. In the test-case at hand, we assumedto deal with 8Mbyte SDRAM chips, which would al-low to face problems featuring up to two millions of field components. This is actually quite a reasonablefigure for a wide range of practical cases: nevertheless,much larger SDRAM are being designed and produced,which could be exploited almost straightforwardly (onlyminor changes would be needed in the memory controlsection, to manage a larger address space). In any case,the SDRAM physical size does not necessarily repre-sent an upper limit for the mesh-size: since, due to thediscretization algorithm, equations are decoupled, thesimulation domain can be easily partitioned into sepa-rate segments. At the expense of some overhead andof increased complexity in the control software (i.e., aroutine would have to manage swapping of the meshsegments), each segment could be independently con-sidered by the FDTD engine. 4. Performance Estimate To evaluate the performance of the proposed architec-ture, and to allow for comparison with software imple-mentation on a general purpose architecture, a fairlysimple case has been considered: a cubic resonantcavity has been simulated, discretized by means of a21 × 21 × 21 cells. After a pulsed stimulus is injectedat a given location (accounting for a non-zero initialfield) the transient simulation is carried out: given suchan elementary structure, it makes little sense, here, tolook at the simulation results in terms of actual time
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks