Leadership & Management

Design decisions in the pipelined architecture for Quantum Monte Carlo Simulations

Description
Design decisions in the pipelined architecture for Quantum Monte Carlo Simulations
Published
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
    Design Decisions in the Pipelined Architecture for Quantum Monte Carlo Simulations Akila Gothandaraman, Gregory D. Peterson Department of Electrical Engineering and Computer ScienceThe University of Tennessee, Knoxville{akila, gdp} @utk.edu Robert J. Hinde, Robert J. Harrison Department of ChemistryThe University of Tennessee, Knoxville{rhinde, rharrison} @utk.edu  Abstract   —The ground-state properties of atomic andmolecular clusters can be obtained using Quantum MonteCarlo (QMC) simulations. We propose a reconfigurablehardware architecture using Field-Programmable Gate Arrays(FPGAs) to implement the kernels of the QMC application. Toachieve higher clock rates, we experiment with differentpipeline stages for each component of the design and develop adeeply pipelined architecture that provides the bestperformance in terms of clock rate, while at the same time hasa modest use of embedded memory and multiplier resources sowe can fit additional functions in a future implementation.Here, we discuss the details of the pipelined architecture andour design decisions while developing a general framework that can be used to obtain the potential energy of atomic ormolecular clusters and extended to compute other usefulproperties. I.   I  NTRODUCTION  Quantum Monte Carlo (QMC) is a powerful simulationtechnique for solving the many-body Schrödinger equation.Using this method, we can investigate atomic and molecular clusters and treat accurately quantum mechanical systems toobtain the energetics and equilibrium structures of thesesystems. We can obtain the ground-state properties, such as potential energy which is O(  N  2 ) in the number of particles. Asoftware QMC application can be used to simulate thesesystems on general-purpose processors. However, to speedupthe application and to provide better computational scalingwith the system size, we seek other methods to accelerate theapplication kernels. We take advantage of the inherent parallelism of the Monte Carlo simulations [1] and usehardware-based techniques to exploit additional parallelismusing hardware pipelining. Due to the complex nature of thefunctions involved, employing interpolation-basedapproximation is a natural choice for calculating thesefunctions. Also, the exact shape of the function depends onthe nature of the atoms involved and hence it is desirable touse programmable hardware rather than a hardwired logicimplementation so that we can vary the parametersdepending on the system. We present our ongoing researchin [2], [3] where we achieve promising results when thekernels of the QMC application are implemented onreconfigurable platforms consisting of Field-ProgrammableGate Arrays (FPGAs). Here, we focus on the architecturaldetails, our design goals and choices while designing the pipelined architecture that exploits the  polygranular    parallelism of FPGAs to calculate the potential energy of atomic clusters.The substantial FPGA hardware implementation effortsare only justified if the design meets our goals. We consider the following two goals:  performance , which is measured bythe speedup achieved by our hardware implementationcompared to the best optimized serial software application,and accuracy , where we compare the results of our implementation against the software version employingdouble-precision floating point representation. We satisfy theabove design goals by developing a deeply pipelinedreconfigurable architecture that employs a fixed-pointrepresentation, which is chosen after careful error analysis.Another design goal is  flexibility , i.e., to create a generaluser-friendly framework that can be used to simulate avariety of atomic clusters, which have the same overallfunction, only with slightly different simulation parameters.Also, extending the present framework to calculate other related properties will be extremely useful for scientists tounderstand large atomic or molecular clusters. Our presentimplementation meets the above design goals by providingspeedups from 1.5x-11x for the potential energy kernel ontwo reconfigurable test platforms and an accuracy on theorder of or better than the double-precision softwareimplementation. In [4], a QMC application ported on thenVidia 7800 GTX GPU provides a speedup of 6x over theoptimized software application running on a 3.0GHz Intel P4 processor. To the best of our knowledge, our work is the firstattempt in accelerating a QMC application usingreconfigurable computing.The paper is organized as follows. In Section II, wedescribe the different components of the pipelined hardwareimplementation. In Section III, we compare the results of our implementation on two reconfigurable test platforms. We provide conclusions in Section IV. 978-1-4244-2167-1/08/$25.00 ©2008 IEEE 165 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 19:51 from IEEE Xplore. Restrictions apply.   Figure 1: Overall system block diagramII.   D ESCRIPTION O F T HE P IPELINED H ARDWARE  Figure 1 shows the block diagram of the pipelinedarchitecture. The architecture consists of the followingcomponents: CalcDist  calculates the squared distances between the pairs of atoms; CalcPE  evaluates the potentialenergy using an interpolation scheme;  AccPE  accumulatesthe potential energies to result in partial results that will bedelivered to the host processor. The CalcPE  module uses asinputs the squared distances from CalcDist  and theinterpolation coefficients to compute the potential energyfunction using quadratic interpolation. The abovecomponents are implemented on the FPGA. The softwareQuantum Monte Carlo (QMC) on the host processor deliversthe appropriate inputs and collects the results to and from theFPGA. In addition, we can re-use the calculated results toobtain related properties with minimal or no changes in theactual design process. For instance, we have employed asimilar interpolation strategy to obtain another property,wave function, which re-uses the squared distancecalculation component. Here, we restrict our discussion tothe potential energy calculation. For each component, weevaluate the design choices available, their pros and cons andour chosen implementation, in the following sub-sections.  A.   Calculation of squared distance (CalcDist) The squared distance calculator, CalcDist  is a genericmodule that computes the pair-wise squared distances between the atoms. Figure 2 shows the block diagram of thiscomponent with latencies in clock cycles for the sub-blocks.The host processor stores the co-ordinate positions of theatoms onto the embedded Block RAMs (BRAMs).First, we consider different ways of storing the co-ordinate positions onto the BRAMs. Two BRAMs can beused to store the pairs of co-ordinate distances, one BRAMcould store the two’s component of the distances. Two suchBRAMs would require 3*256 KB memory to store 4096 32- bit (x, y, z) positions. We reduce the memory requirement touse a single dual-port Block RAM to store each co-ordinate position. A state machine controller provides the readaddresses to the Block RAM every clock cycle. The Block RAM data outputs are connected to the inputs of the CalcDist  module. We have several design choices toimplement each sub-block in this module. This module relieson additions, subtractions and multiplication operations for each co-ordinate position. We use the IP cores from XilinxCoreGen library [5] to implement the adders, subtractors andmultipliers. The cores are available with a variety of  pipelining options, variable input-output data-widths andcustomized for the target architecture. We use the embeddedmultipliers that are available as hard macros on the XilinxVirtex series. We experiment with the available pipeliningoptions and our current implementation uses the multiplierswith maximum pipelining (seven clock cycle latency,minimum pipelining has a latency of 2 clock cycles) to provide higher clock rates. The initial latency of this moduleis ten clock cycles after which it produces a 52-bit squareddistance every clock cycle once the pipeline is full. The datawidths for inputs and outputs are chosen after analyzing theerrors with various fixed-point representations.Figure 2: Squared distance calculation, CalcDist    166 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 19:51 from IEEE Xplore. Restrictions apply.   Figure 3:   Potential energy calculation, CalcPE     B.   Calculation of potential energy (CalcPE) Figure 3 shows the potential energy calculation module, CalcPE  . The potential energy function depends on the typeof the atoms involved in the simulation. We keep in mind theabove mentioned design goals namely performance,accuracy and flexibility during the design process. The potential energy function is re-written as a function of thesquared distance, r  2 . Rewriting the potential energy functionas a function of squared distance is performance critical because square root operation on r  2 to obtain co-ordinatedistance is expensive in hardware. Due to the numerical behavior of the potential energy function, we divide the potential energy function to two regions: region I and regionII and employ unique transformation schemes in each regionto increase the accuracy of our approximations [3].To evaluate the potential energy, we require theinterpolation coefficients { a, b, c }, the squared distances anda delta value related to the squared distances. In addition, wespecify a cut-off value,  sigma that divides the two regions.Of the inputs required, the interpolation constants and thecut-off value depend on the exact shape of the function,which in turn varies slightly depending on the identity of theatoms being simulated. Hence, the coefficients are stored inthe embedded Block RAMs and the cut-off value is inputinto a register. These initializations will be done by the host processor. A change in the simulation system only requiresthat we load a different set of parameters that bestapproximates the corresponding potential energy function.We have a number of design choices here. First, we havea choice of the order of interpolation. We try quadratic andcubic interpolation for the potential energy function. As onewould expect, cubic interpolation consumes more Block RAMs to store the interpolation constants, {a, b, c, d} thanquadratic interpolation. Also, more multipliers are needed toevaluate the function itself. Using a quadratic interpolationconsumes fewer resources and delivers the requirednumerical accuracy. Since our design goal is to extend thecurrent framework to fit additional functions, we choosequadratic interpolation which will allow us to use theremaining resources to calculate other properties. After choosing the interpolation order, we decide the data widthand number of interpolation coefficients that produceaccurate potential energy results (compared to the softwareQMC application) after function evaluation. This is a trade-off between accuracy and the Block RAM resource usage.After analysis, we choose 256 52-bit coefficients for region Iand 1344 52-bit coefficients in region II. This consumes 16KB memory for region I and 128 KB memory for region IIcoefficients.For the potential energy function evaluation, we use theXilinx IP cores to implement adders and multipliers. We usemaximum pipelining for the multipliers to improve the clock rates. Figure 3 shows the latencies in clock cycles in eachstep. The cut-off value,  sigma decides whether the squareddistance falls into region I or II. Accordingly, theinterpolation constants, {a, b, c} can be obtained from thelook-up table to evaluate the function. Once we fill the pipeline, this module outputs a 52-bit potential energy valueevery clock cycle. C.    Accumulation of potential energy (AccPE) Figure 4 shows the  AccPE    module that accumulates theintermediate values of potential energy produced by CalcPE  . The potential energy function is transformed prior to interpolation, hence we use two types of accumulators:for region I potential results, we multiply the intermediateresults and for region II potential results, we add theintermediate results. The potential values which result froma region I squared distances and interpolation constants arecalled region I potential values. Similarly, region II potentialvalues result from region II squared distances andinterpolation constants. The region II accumulator isinstantiated from the Xilinx CoreGen library thataccumulates the result (region II potential if applicable or azero if a region I potential is produced that clock cycle). For accumulating region I potential values, we have to form products of the potential energy result every clock cycle(region I potential if applicable or a one if a region II potential is produced in that clock cycle). The XilinxCoreGen multipliers are used with maximum pipelining,which have a latency of seven clock cycles. Hence, we useseven instances of the multipliers and switch between themultipliers to increase the data rate and produce a partial product every clock cycle. At the end of our calculations for   N  atoms, we have a single partial sum from region II andseven partial products from region I.It should be noted that we perform successivemultiplications of potential values and hence this will resultin a loss of precision. To guarantee sufficient accuracy, we bit-shift the result to the left after computing the partial product (if it is less than 2 -1 ) and keep track of the number of shifts. There are also overflow issues associated with theaccumulator while accumulating region II potential values.To overcome this problem, we accumulate the region II potentials in a large register that can hold  N  evaluations of the maximum value of the re-scaled potential. Theaccumulated results will be processed by the host processor,which reconstructs the floating-point value of the total potential energy. 167 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 19:51 from IEEE Xplore. Restrictions apply.   Figure 4: Potential energy accumulation,  AccPE   Table 1: Comparison of resource usage for PE kernel Resource type Platform 1 Platform 2Slices 83 % 27%Block RAMs 73 % 26%18x18multipliers/DSP48s62 % 26% Table 2: Performance (speedup) for PE kernelPlatform 1 Platform 2Speedup 1.5x 11xIII.   R  ESULTS  Table 1 compares the usage of resources – Slices, Block RAMs and embedded 18x18 multipliers (on Virtex-II Pro) or Digital Signal Processing (DSP48) (on Virtex-4) for two test platforms consisting of two different Virtex series FPGAs.Platform 1 is an Amirix AP130 PCI FPGA development board [6] with a Virtex-II Pro series FPGA (XC2VP30)connected to an Intel Xeon 2.4 GHz processor. Platform 2 isthe Pacific Cray XD1 platform [7] in which we use a nodeconsisting of a Virtex-4 series FPGA (V4LX160) connectedto a dual-processor dual-core AMD Opteron 2.2 GHz. For the two platforms, we use an entirely software QMCapplication written in C++ (-03 optimization) for  benchmarking purposes. We then compare the potentialenergy results and the execution times of this software QMCapplication to the hardware accelerated application in whichthe potential energy calculation is delegated to the FPGA.The design choices stated in Section II are chosen for each component to meet our design goals, which are performance, accuracy and flexibility. Platform 1 consumesa vast amount of resources for platform related functions andhence after the potential energy function is mapped on to theFPGA, we have very limited resources available toimplement additional functions. Platform 2 consists of ahigher density FPGA and consumes roughly only 25 percentof the resources for potential energy function. Table 2 showsthe speedups achieved for the two test platforms. Platform 1 performs poorly due to the slow communication interface between the FPGA and the host processor. This is overcomein Platform 2, which offers a better communication interfaceand hence we can achieve the performance that is expectedfrom our implementation. Our design choices including theuse of pipelined architecture, fixed-point representation, anda quadratic interpolation scheme to evaluate the functionhelp to achieve a significant performance improvementcompared to the software version running on the processor,while at the same time with modest resource consumption.IV.   C ONCLUSIONS  We present a pipelined reconfigurable architecture toobtain the potential energy of clusters of atoms. We outlinesome of the goals that are critical for our design and theavailable design choices for the various building blocks of our architecture. The hardware implementation of a largeapplication like ours offers a number of choices during thedesign process, which have to be carefully evaluated to balance the performance, numerical accuracy and resourceusage consumption. Our hardware design strategy for theQuantum Monte Carlo simulation offers promising speedupresults while offering the desired numerical accuracy.A CKNOWLEDGMENTS  This work was supported by the National ScienceFoundation grant NSF CHE-0625598, and the authorsgratefully acknowledge prior support for related work fromthe University of Tennessee Science Alliance.R  EFERENCES   [1]   J. Doll, D. L. Freeman, Monte Carlo Methods in Chemistry,  IEEE Computational Science and Engineering  , Vol. 1, Issue 3, pp.22-32,Spring 1994.[2]   A. Gothandaraman, G. L. Warren, G. D. Peterson, R. J. Harrison,“Reconfigurable Accelerator for Quantum Monte Carlo Simulationsin  N  -body systems,” (poster), Supercomputing Conference , November 2006.[3]   A. Gothandaraman, G. D. Peterson, G. L. Warren, R. J. Hinde, R. J.Harrison, “FPGA acceleration of a quantum Monte Carloapplication,”  Parallel Computing  , Vol. 34, Issue 4-5, pp. 278-291,May 2008.[4]   A. Anderson, W. A. Goddard III, P. Schröder, “Quantum MonteCarlo on graphical processing units,” Computational PhysicsCommunications , Vol. 177, Issue 3, pp. 298-306, August 2007.[5]   Xilinx Inc.,http://www.xilinx.com/ipcenter/index.htm [6]   Amirix Systems Inc.,http://www.amirix.com  [7]   Cray Inc.,http://www.cray.com/products/xd1/index.html   168 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 19:51 from IEEE Xplore. Restrictions apply.
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks