Social Media

A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs

A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Framework for Rapid System-level Exploration,Synthesis, and Programming of Multimedia MP-SoCs MarkThompson † ,HristoNikolov ‡ ,TodorStefanov ‡ ,AndyD.Pimentel † ,CagkanErbas † ,SimonPolstra † ,EdF.Deprettere ‡† DepartmentofComputerScience ‡ LeidenEmbeddedResearchCenterUniversityofAmsterdam,TheNetherlandsLeidenUniversity,TheNetherlands { thompson,andy,cagkan,spolstra } { nikolov,stefanov,edd } ABSTRACT In this paper, we present the Daedalus framework, which allowsfor traversing the path from sequential application specification toa working MP-SoC prototype in FPGA technology with the (paral-lelized) application mapped onto it in only a matter of hours. Dur-ing this traversal, which offers a high degree of automation, guid-ance is provided by Daedalus’ integrated system-level design spaceexploration environment. Weshow that Daedalus offers remarkablepotentials for quickly experimenting with different MP-SoC archi-tectures and exploring system-level design options during the veryearly stages of design. Using a case study with a Motion-JPEG en-coder application, we illustrate Daedalus’ design steps and demon-strate its efficiency. Categories and Subject Descriptors J.6 [ Computer-aided Engineering ]: Computer-aided design General Terms Performance, design Keywords Design space exploration, system-level design and synthesis, rapidprototyping 1. INTRODUCTION The complexity of modern embedded systems, which are in-creasingly based on heterogeneous MultiProcessor-SoC (MP-SoC)architectures, has led to the emergence of system-level design. Tocope with this design complexity, system-level design aims at rais-ing the abstraction level of the design process. Key enablers to thisend are, for example, the use of architectural platforms to facilitatere-use of IPcomponents and thenotion of high-level systemmodel-ing and simulation [7]. The latter allows for capturing the behaviorof platform components and their interactions at a high level of ab-straction. As such, these high-level models minimize the modelingeffort and are optimized for execution speed, and can therefore beapplied during thevery earlydesign stages to perform, for example, Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. CODES+ISSS’07, September 30–October 3, 2007, Salzburg, Austria.Copyright 2007 ACM 9781595938244/07/0009 ... $ 5.00. architectural Design Space Exploration (DSE). Such early DSE isof paramount importance as early design choices heavily influencethe success or failure of the final product.System-level design for MP-SoC based embedded systems typi-cally involves anumber of challenging tasks. For example, applica-tions need tobedecomposed intoparallel specifications so that theycan be mapped onto an MP-SoC architecture [10]. Subsequently,applications need to be partitioned into HW and SW parts sinceMP-SoC architectures often are heterogeneous in nature. To thisend, MP-SoC platform architectures need to be modeled and simu-lated to study system behavior and to evaluate a variety of differentdesign options. Once agood candidate architecture has been found,it needs to be synthesized, which involves the synthesis of its archi-tectural components as well as the mapping of applications onto thearchitecture. To accomplish all of these tasks, a range of differenttools and tool-flows is often needed, potentially leaving designerswith all kinds of interoperability problems. Moreover, there typi-cally remains a large gap between the deployed system-level mod-els and actual implementations of the system under study, knownas the implementation gap [11]. Currently, there exist no maturemethodologies, techniques, and tools to effectively and efficientlyconvert system-level system specifications to RTL specifications.In this paper, we present the Daedalus framework which ad-dresses these system-level design challenges. Daedalus’ main ob- jective is to bridge the aforementioned implementation gap for thedesign of multimedia MP-SoCs. It does so by providing an inte-grated and highly-automated environment for system-level archi-tectural exploration, system-level synthesis, programming and pro-totyping. Whereas our prior publications reported on several of Daedalus’ components in isolation (e.g., [21, 15, 13]), this paperfocuses on how the different components fit together as the piecesof a puzzle, resulting in a system-level design environment that ad-dresses the entire design trajectory with an unparalleled degree of automation. We will illustrate the framework and its design flowusing a case study with a Motion-JPEG encoder application.The next section provides a birds-eye overview of Daedalus, af-ter which the three subsequent sections present the three core toolsthat constitute Daedalus in more detail. More specifically, Section3 explains how multimedia applications are automatically decom-posed inparallelspecifications. Section4describes how –giventheparallel application(s) – promising candidate architectures can befound using our system-level modeling, simulation and explorationmethodology and toolset. In Section 5, we explain how selectedcandidate architectures can be automatically and rapidly synthe-sized, programmed and prototyped. Section 6 presents a Motion-JPEG case study to illustrate Daedalus’ design flow. In Section 7,we present related work, after which Section 8 concludes the paper.  2. THE DAEDALUS FRAMEWORK In Figure 1, the design flow of the Daedalus framework is de-picted. As mentioned before, Daedalus provides a single environ-ment for rapid system-level architectural exploration, high-levelsynthesis, programming and prototyping of multimedia MP-SoCarchitectures. Here, a key assumption is that the MP-SoCs areconstructed from a library of pre-determined and pre-verified IPcomponents. These components include a variety of programmableand dedicated processors, memories and interconnects, thereby al-lowing the implementation of a wide range of MP-SoC platforms.The remainder of this section provides a high-level overview of Daedalus, after which the subsequent sections zoom in on its corecomponents and how they interact with the rest of the design flow.Starting from a sequential application specification in C or C++,the KPNgen tool [21] allows for automatically converting the se-quential application into a parallel Kahn Process Network (KPN)[8] specification. Here, the sequential input specifications are re-stricted to so-called static affine nested loop programs, which is animportant class of programs in, e.g., the scientific and multimediaapplication domains. By means of automated source-level transfor-mations [17], KPNgen is also capable of producing different input-output equivalent KPNs, in which for example the degree of paral-lelism can be varied. Such transformations enable application-leveldesign space exploration.The generated or handcrafted KPNs (the latter in the case that,e.g., the input specification did not entirely meet the requirementsof the KPNgen tool) can subsequently be used by our Sesame mod-eling and simulation environment [15] to perform system-level ar-chitectural DSE. To this end, Sesame uses (high-level) architec-ture model components from the IP component library. Sesameallows for quickly evaluating the performance of different appli-cation to architecture mappings, HW/SW partitionings, and targetplatform architectures. Such DSE should result in a number of promising candidate system designs, of which their specifications(system-level platform description, application-architecture map-ping description, and application description) act as input to theESPAM tool [13]. This tool uses these system-level input specifi-cations, together with RTL versions of the components from the IPlibrary, to automatically generate synthesizable VHDL that imple-ments the candidate MP-SoC platform architecture. In addition, italso generates the C/C++ code for those application processes thatare mapped onto programmable cores. Using commercial synthe-sis tools and compilers, this implementation can be readily mappedonto an FPGA for prototyping. Such prototyping also allows forcalibrating and validating Sesame’s system-level models, and as aconsequence, improving the trustworthiness of these models.Ultimately, we aim at traversing Daedalus’ design flow – goingfrom a sequential application to a working MP-SoC prototype inFPGA technology with the application mapped onto it – in a matterof hours. Evidently, this would offer great potentials for quicklyexperimenting with different MP-SoC architectures and exploringdesign options during the early stages of design. As our case studyin Section 6 shows, we are well underway of achieving this goal. 3. PARALLELIZING APPLICATIONS Today, traditional imperative languages like C or C++ are stilldominant with respect to implementing applications for SoC-basedarchitectures. It is, however, difficult to map these imperative im-plementations, with typically a sequential model of computation,onto MP-SoC architectures that allow for exploiting task-level par-allelism in applications. In contrast, models of computation thatinherently express task-level parallelism in applications and make Xbar MM asic µ P µ PP µ MP−SoC FPGA Gate−levelspecificationspecificationSystem−levelRTLspecification    V  a   l   i   d  a   t   i  o  n   /   C  a   l   i   b  r  a   t   i  o  n componentsLibrary of IP modelsRTLmodelsHigh−level AuxiliaryfilesPlatform XMLin XMLcode forC/C++in VHDLIP coresnetlistPlatformMapping spec.Kahn ProcessNetwork in XMLApplicationprocessorsParallelization (KPNgen) Automated system−level synthesis (ESPAM) System−level architectural exploration (Sesame) RTL synthesis (commercial tool, e.g. Xilinx Platform Studio)(C/C++) Figure 1: The Daedalus design flow. communications explicit, such as CSP [5] and Process Networks[8], allow for easier mapping onto MP-SoC architectures. How-ever, specifying applications using these models of computationusually requires more implementation effort in comparison to se-quential imperative solutions.In Daedalus, we start from a sequential imperative applicationspecification (C/C++) which is then automatically converted into aKahnProcess Network (KPN)[8] usingtheKPNgentool [21]. Thisconversion is fast and correct by construction. In the KPN modelof computation, parallel processes communicate with each othervia unbounded FIFO channels. Reading from channels is done ina blocking manner, while writing to channels is non-blocking. Weuse KPNs for application specifications because this model of com-putation nicely fits the targeted media-processing application do-main and is deterministic. The latter implies that the same applica-tioninput always results inthesame applicationoutput, irrespectiveof the scheduling of the KPN processes. This provides completescheduling freedom when, as will be discussed later on, mappingKPN processes onto MP-SoC architecture models for quantitativeperformance analysis and design space exploration.As mentioned before, KPNgen’s input applications need to bespecified as so-called staticaffinenested loop programs toallow forautomatic parallelization of applications. As a first step, KPNgencan apply a variety of source-level transformations to these specifi-cations in order to, for example, increase or decrease the amount of parallelism in the final KPN [17]. Subsequently, the C/C++ code istransformed into single assignment code (SAC), which resemblesthe dependence graph (DG) of the srcinal nested loop program.Hereafter, the SAC is converted to a Polyhedral Reduced Depen-dency Graph (PRDG) data structure, being a compact mathemati-cal representation of a DG in terms of polyhedra. Finally, a PRDGis converted into a KPN by associating a KPN process with eachnode in the PRDG. The parallel KPN processes communicate witheach other according to the data dependencies given in the DG.In Figure 2, a Kahn Process Network example is given in whichthree processes (A, B and C) are connected using three channels(CH1-3). Figure 2(a) shows the XML description of Kahn processB as generated by KPNgen. The XML describes both the topologyof the KPN (i.e., how the processes are connected together, see e.g.lines 20-25) as well as the communications and computations per-formedby processes. Inour example, process Bexecutes afunctioncalled compute (line 8). The function has one input argument (line9) and one output argument (line 10). The relation between the  main() void{ read( p2, in_0, sizeof(myType) );compute( in_0, out_0 ); C p2p1 CH2 B p1 CH3 p1p2p2 A CH1 <fromPort name = "p1" /><toPort name = "p2" /> name = "p2" /> direction = "in" <port name = <process >"B"<process_code name = "compute" ><arg name = "in_0"type = "input" /> name = "out_0"type = "output" /><arg<par_bounds matrix = "[1,0,−1,384;"1,0, 1, −3]" /> 5152025151015write( p1, out_0, sizeof(myType) ); }}voidforint *isEmpty = port + 1; // reading is blocked if a FIFO is empty while (byte* data)[i] = *port; // read data from a FIFO( int i=o; i<length; i++ ) } ( *isEmpty ) { } read( byte *port, void *data, int length ) void write( byte *port, void *data, int length ) forint *isFull = port + 1; // writing is blocked if a FIFO is full while ( int i=o; i<length; i++ )( *isFull ) { } *port = (byte* data)[i]; // write data to a FIFO }{{} 2520 } name = CH2 <channel ><toProcess name = "B" /><fromProcess /></channel name = "A" . . . XML specification of a KPN a) b) Program code, generated by ESPAM 101 </port</port<var name = "out_0" <var name = "in_0" /> type = "myType" <port name = "p1"direction = "out" /> type = "myType" /></loop</process_code</process ><loop parameter = "N" > index = "k" <loop_bounds/> matrix = "[1, 1,0,−2;"1,−1,2,−1]" for ( int k=2; k<=2*N−1; k++ ) {{{ Figure 2: A Kahn Process Network example. function arguments and the communication ports of the process isgiven in lines 3 and 6. The function has to be executed 2 ∗  N  − 2times as specified by the polytope in lines 12-13. The value of  N  isbetween 3 and 384 (lines 14-15).From the XML specification, Daedalus allows for automaticallygenerating theC/C++code implementingthebehavior ofeachKPNprocess. This is done by the ESPAM tool, which will be dis-cussed later on. Figure 2(b) shows, for example, the generated Ccode for process B (some variable declarations have been omitted).The code contains the main behavior of a process, together withthe read/write communication primitives. In accordance with theXML specification in Figure 2(a), the function compute – which isderived from the srcinal sequential application specification – ispart of a loop that iterates 2 ∗  N  − 2 times. For synthesis purposes,Daedalus also allows for generating the code for the read and writecommunication primitives, as shown in Figure 2(b). Currently,theseprimitivesareimplementedusingpollingandmemory-mappedI/O. Note that the implementation of the write primitive is blockingsince at implementation level FIFO channels are bounded in size. 4. DESIGN SPACE EXPLORATION Given a (set of) KPN application specification(s) – as for exam-ple generated by KPNgen or devised by hand – and the componentsin Daedalus’ IP library, the Sesame system-level simulation frame-work [15] addresses the problem of finding a suitable and efficienttarget MP-SoC platform architecture. Figure 3 illustrates Sesame’slayered infrastructurefor thecase inwhichaMotion-JPEG applica-tion is studied with a crossbar-based distributed-memory MP-SoCas target architecture. Sesame deploys separate application and ar-chitecture models, where an application model describes the func-tional behavior of an application and an architecture model definesarchitecture resources and captures their performance constraints.After explicitly mapping an application model onto an architecturemodel, they are co-simulated via trace-driven simulation. This al-lows for evaluation of the system performance of a particular ap-plication, mapping, and underlying architecture. Essential in thismethodology is that an application model is independent from ar-chitectural specifics and assumptions on hardware/software parti-tioning. As a result, a single application model can be used to exer-cise different hardware/software partitionings and can be mappedonto a range of architecture models, possibly representing differ-ent architecture designs or modeling the same architecture designat various levels of abstraction. DCTQuantVLE P0P1P3 Init Kahn ApplicationmodelArchitecturemodel Eventtrace Mapping layer description +Binding+schedulingpoliciesparametersrun−timeStructural XML descriptions performancedescription +Structuralparameters mem.Localmem.Localmem.Localmem.Localbufferbufferbufferbuffer Crossbar switchP2 Video−inVideo−out Figure 3: Sesame’s layered infrastructure. For application modeling, the computational and communica-tion behavior of the KPN application specifications are capturedusing application event traces . The computation and communi-cation events in these traces typically are coarse grained, such as  Execute(DCT) or Read(channel id, pixel-block) . To generate theapplication events, the C/C++ code of each Kahn process is in-strumented with annotations that describe the application’s com-putational actions. In addition, Sesame provides read and writecommunication primitives that generate communication events asa side-effect. So, by executing the KPN model, each process gen-erates itsown traceofapplication events, representing theworkloadthat is imposed on the underlying MP-SoC architecture model.An architecture model simulates the performance consequencesof the computation and communication events generated by an ap-plication model. To this end, each component in the architecturemodel is parameterized with performance parameters specifyingthe latencies of computation events like Execute(DCT) , communi-cation transactions, and memory accesses. This approach allows toquickly assess, e.g., different HW/SW partitionings by simply ex-perimenting with the latency parameters of processing componentsin the architecture model: a low computational latency refers to aHW implementation while a high latency mimics a SW solution.To bind application tasks to resources in the architecture model,Sesame provides an intermediate mapping layer  . It controls themapping of Kahn processes (i.e. their event traces) onto archi-tecture model components by dispatching application events to thecorrect architecture model component. The mapping also includesthe mapping of Kahn channels onto communication resources inthe architecture model. The mapping layer has two additional pur-poses. First, the event dispatch mechanism in the mapping layerprovides a variety of static and dynamic policies to schedule appli-cation tasks (i.e., their event traces) that are mapped onto sharedarchitecture model components. Second, the mapping layer is alsocapableof dynamically transformingapplicationevents into(lower-level) architecture events in order to facilitateflexible refinement of architecture models [15].The output of system simulations in Sesame provides the de-signer with performance estimates of the system(s) under studytogether with statistical information such as utilization of architec-tural components (idle/busytimes), thecontention inasystem(e.g.,network contention), profiling information (time spent in differentexecutions), critical path analysis, and average bandwidth betweenarchitecture components. Such results allow for early evaluation of different design choices, identifying trends in the systems’ behav-ior, and can help in revealing performance bottlenecks early in thedesign cycle. Here, the exploration process is also facilitated bythe fact that system configurations (bindings, scheduling and arbi-tration policies, performance parameters, and so on) are specified  using XML descriptions. Hence, different system configurationscan be rapidly simulated without remodeling and/or recompilation.As a result of the design space exploration with Sesame, a smallset of promising MP-SoC platform instances can be selected forautomatic synthesis (see next section). Each selected platform in-stance is specified using two XML files. One describing the ar-chitectural platform at the system level, i.e. which IP componentsare used in the platform and how they are interconnected. And theother describing how application tasks are mapped onto the plat-form components. 5. SYSTEMLEVEL SYNTHESIS The system-level specifications that result from DSE – describ-ing (the structure of) the application and platform architecture aswell as the mapping of the former onto the latter – are given asinput to the ESPAM tool for system-level synthesis [13]. To guar-antee correctness-by-construction, ESPAM first runs a consistencycheck on the provided platform instance. This includes findingimpossible and/or meaningless connections between system-levelplatform components as well as parameter values that are out of range. Subsequently, ESPAM refines the abstract platform modelto a parameterized RTL model which is ready for an implemen-tation on a target physical platform. The refined system compo-nents are instantiated by setting their parameters based on the tar-get physical platform features. Finally, ESPAM generates program(C/C++) code for each programmable processor in the multipro-cessor platform in accordance with the application and mappingspecifications. To this end, it uses the XML specifications gener-ated by KPNgen. In addition, ESPAM also provides the supportfor scheduling the code in the case multiple application processesare mapped onto a single processor in the platform. Currently, thiscode scheduling is performed statically.The output of ESPAM, namely an RTL specification of the MP-SoC platform, is a model that can adequately abstract and exploitthe key features of a target physical platform at the register trans-fer level. It consists of four parts (as shown in Figure 1): 1) a  platform topology description defining in greater detail the struc-ture of the multiprocessor platform; 2) hardware descriptions of  IP cores containing predefined and custom IP cores used in 1).These IP cores, which are selected from Daedalus’ IP componentlibrary, include programmable as well as dedicated processors, var-ious memory components (FIFO buffers, random access memory,etc.), and different interconnects (point-to-point links, shared buswith various arbitration mechanisms, and a crossbar switch). Forprogrammable processors, ESPAM currently uses PowerPCs andMicroblazes since it targets Xilinx Virtex-II-Pro FPGA technologyfor prototyping the synthesized MP-SoCs. ESPAM also automat-ically generates custom IP cores needed as a glue/interface logicbetween components in the platform; 3) the program code for pro-cessors — as mentioned before, to execute the software parts of the application on the synthesized multiprocessor platform, and 4)  Auxiliary information containing files which give tight control onthe overall specifications, such as defining precise timing require-ments and prioritizing signal constraints.With the above descriptions, a commercial synthesizer can con-vert an RTL specification to a gate-level specification, thereby gen-erating the target platform gate-level netlist (see the bottom part of Figure 1). At this moment, ESPAM facilitates automated MP-SoCsynthesis and programming using Xilinx VirtexII-Pro FPGAs andtherefore uses the Xilinx Platform Studio (XPS) tool as a back-endto generate the final bit-stream file that configures the FPGA. How-ever, our framework is general and flexible enough to be targetedto other physical platform technologies as well. 6. A CASE STUDY This section presents a case study in which we applied Daedalustoexploredifferent implementationoptions foraMotion-JPEG(M-JPEG) encoder application mapped onto a heterogeneous MP-SoCarchitecture. The case study illustrates Daedalus’ design steps anddemonstrates itspotentials toquickly experiment withdifferent MP-SoC architecture designs during the very early stages of design.The KPN specification of the M-JPEG application was derivedfrom sequential C code using the KPNgen tool as described in Sec-tion 3. A small manual modification (taking no longer than 30minutes) to the srcinal M-JPEG code was necessary to meet theKPNgen input requirements. The resulting Kahn application spec-ification consists of 6 processes, as shown in the top part of Figure3. Generating the KPN specification is a one-time effort since thesame specification is used for all subsequent implementation andexploration steps.To study target MP-SoC architecture instances for the M-JPEGapplication, we selected a crossbar-based MP-SoC platform withup to 4 processors (MicroBlaze or PowerPC) and distributed mem-ory. At the bottom part of Figure 3, a 4-processor instance of this platform is depicted. We modeled this platform architecturewith the Sesame framework. The processor, memory and inter-connect components in our architecture model were taken directlyfrom Daedalus’ high-level model component library. Only the per-formance parameters specific to the selected platform architectureneeded to be specified, such as the latencies for computational ac-tions, thelatenciesfor settingupandcommunicating over thecross-bar, and so on. We determined the values of these performanceparameters by a combination of measurements on an ISS simula-tor (for the computational latencies on the MicroBlaze and Pow-erPC processors) and on the actual hardware itself. Note that thisneeds to be done only once for each application, since the valuescan be reused throughout the exploration process. More informa-tion about the calibration of our architectural performance modelscan be found in [16]. Moreover, the mapping layer in our system-level model is configured such that it models the static schedulingscheme as facilitated by the ESPAM framework (see Section 5).To this end, for shared architecture components, the mapping layerdynamically groups trace events that srcinate from the same Kahnprocess and interleaves these event groups in the same manner aswould be the result of ESPAM’s static scheduling.In our design space exploration experiments, we selected threedegrees of freedom, namely the number of processors in the plat-form (1 to 4), the type of processors (MicroBlaze or PowerPC)and the mapping of application processes onto the processors. Forthe sake of simplicity, the network configuration (crossbar switch)as well as the buffer/memory sizes remained unaltered (althoughthese could also have been included in the exploration). For thisparticular case study, we were able to exhaustively explore the re-sulting design space – consisting of 10,148 design points – usingsystem-level simulation, where the M-JPEG application was exe-cuted on 8 consecutive 128x128 resolution frames for each designpoint. As can be seen in Table 2, this design space sweep took 2.5 hours, demonstrating Sesame’s efficiency. Figure 4 shows forthree platform instances the relation between mappings and systemperformance, where we sorted the different mapping instances onperformance. It clearly illustrates the importance of finding a goodmapping since non-optimal mappings on larger MP-SoC platformsmay perform worse than a good mapping on smaller MP-SoCs.To validate our DSE experiments, we selected a number of de-sign points with random application-to-architecture mappings andsynthesized and prototyped them using ESPAM. The results of these validation experiments are shown in Figure 5. Note that a
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks