Compiling Scilab to high performance embedded multicore systems

The mapping process of high performance embedded applications to today's multiprocessor system-on-chip devices suffers from a complex toolchain and programming process. The problem is the expression of parallelism with a pure imperative
of 17
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Compiling Scilab to high performance embedded multicore systems Timo Stripf  a, ⇑ , Oliver Oey a , Thomas Bruckschloegl a , Juergen Becker a , Gerard Rauwerda b , Kim Sunesen b ,George Goulas c , Panayiotis Alefragis c , Nikolaos S. Voros c , Steven Derrien d , Olivier Sentieys d ,Nikolaos Kavvadias e , Grigoris Dimitroulakos e , Kostas Masselos e , Dimitrios Kritharidis f  , Nikolaos Mitas f  ,Thomas Perschke g a Institute for Information Processing Technologies (ITIV), Department of Electrical Engineering, Karlsruhe Institute of Technology (KIT), Germany b Recore Systems, The Netherlands c Embedded System Design cation Group, Department of Telecommunication Systems and Networks, Technological Educational Institute of Mesolonghi, Greece d INRIA Research Institute, Université de Rennes I, France e Computer Systems Laboratory, Department of Computer Science and Technology, University of Peloponnese, Greece f  Broadband & Wireless Systems Department Intracom S.A. Telecom Solutions, Greece g Fraunhofer-Institute of Optronics, System Technologies and Image Exploitation, Germany a r t i c l e i n f o  Article history: Available online 26 July 2013 Keywords: Software toolchainMulti-processor system-on-chipScilabCompilationFine- and coarse-grain parallelization a b s t r a c t The mapping process of high performance embedded applications to today’s multiprocessor system-on-chip devices suffers from a complex toolchain and programming process. The problem is the expressionof parallelism with a pure imperative programming language, which is commonly C. This traditionalapproach limits the mapping, partitioning and the generation of optimized parallel code, and conse-quently the achievable performance and power consumption of applications from different domains.TheArchitectureorientedparaLlelizationforhighperformanceembeddedMulticoresystemsusingscilAb(ALMA)Europeanproject aimstobridgethesehurdles throughtheintroductionandexploitationof aSci-lab-based toolchain which enables the efficient mapping of applications on multiprocessor platformsfromahighlevelofabstraction.TheholisticsolutionoftheALMAtoolchainallowsthecomplexityofboththeapplicationandthearchitecturetobehidden,whichleadstobetteracceptance,reduceddevelopmentcost, and shorter time-to-market. Driven by the technology restrictions in chip design, the end of expo-nential growth of clock speeds and an unavoidable increasing request of computing performance, ALMAis a fundamental step forward in the necessary introduction of novel computing paradigms andmethodologies.   2013 Elsevier B.V. All rights reserved. 1. Introduction Efficient,flexible,andhighperformancechipsareneeded.Manyperformance-critical applications (e.g. digital video processing,telecoms, and security applications) that need to process hugeamounts of data in a short time would benefit from these attri-butes. ResearchprojectssuchasMORPHEUS[1] andCRISP[2]have demonstrated the feasibility of such an approach and presentedthebenefitofparallelprocessingonrealhardwareprototypes.Pro-viding a set of programming tools for respective cores is howevernot enough. A company must be able to take such a chip and pro-gram it, based on high-level tools and automatic parallelization/mapping strategies without detailed knowledge of the underlyinghardwarearchitecture. Onlythen, whencombiningtheadvantagesof an  Application-Specific Integrated Circuit   (ASIC) in terms of pro-cessing density, with the flexibility of a  Field-Programmable Gate Array  (FPGA), in addition to it being affordable since it could bemanufactured in larger numbers (like general purpose processorsor FPGAs), it will profit from benefits of programmability andsystem level programming.The  A rchitecture oriented para  L  lelization for high performanceembedded  M  ulticore systems using scil  A b  (ALMA, Greek for ‘‘leap’’)European project [3] intents to deliver a full framework for thedevelopment for parallel and concurrent computer systems. Themain concept is programming in the platform-independenthigh-level language Scilab, which is a pointer-free, numerically-oriented programming language similar to the MATLAB language[4], and still getting an optimized binary for a given hardwarearchitecture automatically from the tools. Scilab, together with 0141-9331/$ - see front matter   2013 Elsevier B.V. All rights reserved. ⇑ Corresponding author. E-mail addresses: (T. Stripf), (O. Oey), bruckschloegl@, Becker), gerard.rauwerda@recoresystems. com(G. Rauwerda), Sunesen), (G. Goulas), (P. Alefragis), (N.S. Voros), steven. (S. Derrien), (O. Sentieys), (N. Kavvadias), (G. Dimitroulakos), (K. Masselos), (D. Kritharidis), (N. Mitas), thomas. (T. Perschke).Microprocessors and Microsystems 37 (2013) 1033–1049 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage:  ALMA-specific extensions, enables a simplified parallelismextraction. A novel  Architecture Description Language  (ADL), theALMA ADL, is integrated into the whole toolflow for gainingplatform-independencefromthetargetarchitecture.TheALMApar-allel software optimization environment will be combined with aSystemC simulation framework for  Multiprocessor System-on-Chip (MPSoC).Theoverallframeworkisevaluatedbytargetingtwoarchi-tecturesaswellas twoapplicationtestcases.In this paper, we present our concept of the ALMA toolset en-ablingcompilationofScilabsourcecodetomulticorearchitectures.The rest of this paper is organized as follows: First, Section 2 dis-cusses the Scilab input language. Section 3 gives an overview of the ALMA toolset followed by in-depth descriptions of the individ-ualcomponents.ThetoolsetisbasedonanADLthatisexplainedinSection 4. Section 5 introduces the ALMA front-end tools for pars- ing, optimizing, and early performance evaluation of the Scilab in-put language. The coarse-grain parallelism extraction (Section 6)partitions, maps, and schedules the tasks to the target processorcores while the fine-grain parallelism extraction (Section 7) ex-ploits data-level parallelism on instruction level. Parallel platformcode generation (Section 8) compiles the optimized ALMA IR tomachine code that could be simulated by the multicore architec-ture simulator (Section 9). In Section 10, the ALMAtarget architec- ture and application test cases are introduced and Section 11concludes the paper. 2. Scilab input language Withtheendof exponential growthof clockfrequenciescausedby the power wall,  Multi-processor System-on-Chip  (MPSoC) archi-tectures approaches arise as one of the most popular ways to gainhigh performance on embedded systems.Fromthearchitectureperspective,efficientusageofMPSoCsre-quires the exploitation of parallelismon different granularities. Onsystem level, coarse-grain parallelism must be exploited by paral-lelizing and mapping algorithms to different processing cores.Fine-grainparallelismis exploitedoninstructionlevel bytargeting Single Instruction, Multiple Data (SIMD)instructionsthatrequiretheusage of small integer data types and the vectorization of thesource code. Additionally, the usage of efficient supported datatypes (integer or fixed-pointer data types over floating-point datatypes) offers a performance improvement and energy reductionbut is coming along with accuracy reduction. In general, the effi-cient programming of MPSoCs requires significant experienceand knowledge of target-specific optimizations. Thus, the pro-grammability is one of the major problems of these systems.Ontheotherside, theenduserdoesnotwanttocareaboutpar-allelismanddatatypes.Ingeneral,atypicalenduserdoesnothave– or does not want to have – a deep knowledge of the underlyinghardware. The end user wants to develop and explore algorithmson a high level using a simple and comfortable language within anumerical computing environment such as MATLAB [4]. For map-ping his algorithm to the target architecture, the end user wantsaone-buttonsolutionthatprovidesahighperformanceandenergyefficient result. In our approach, we try to bridge the gap betweenthe end user and architecture perspective by providing an inte-grated toolchain for semi-automatic mapping of Scilab code toMPSoC architectures. Scilab is a platform-independent, numeri-cally-oriented, high-level programming language. MATLAB code,which is similar in syntax, can be converted to Scilab. Scilab isone of several open source alternatives to MATLAB.While using the Scilab language for targeting MPSoC architec-turesoffersalotofadvantagestotheenduser,acompilerarchitectwould not select Scilab as a first choice. The language utilizes ma-trix-based computation, dynamic typing, automatic memory man-agement and lacks the ability for expressing concurrency, thusmaking it hard to produce efficient code for MPSoC architectures.Besides that, the Scilab language is very beneficial for automaticparallelization since it does not use pointers. In the following, weexplain the advantage as well as our approach for addressing thelanguage difficulties of the Scilab input language. Dynamic typing  The Scilab language uses dynamically typed variables. Each vari-able can contain any Scilab data type (e.g. strings, boolean, inte-gers, floating-point scalars and especially n-dimensional matricesof these types) and the variable’s data type is specified by valueassignment. Within the Scilab environment, the type checking isperformed at run time – as opposed to at compile time. The run-time type checking is computational intensive and implies theusage of automatic memory management, thus hindering efficientcode generation for MPSoC architectures. To solve this issue, weextendedthe Scilab language withannotations of static type infor-mation. This approach allows the end user to soft migrate Scilabapplications for supporting the ALMA compilation process. Matrix-based computation Scilab uses matrices as the main data type. A variable can containarrays of 1 (vectors), 2 (matrices), or more dimensions. The lan-guage provides simple matrix operations on the data type suchas multiplication. At run time, the size of matrices or vectors isnot fixed and can be changed by matrix operations. Therefore,the user must provide the maximum size and dimension of arraydata types within our ALMA annotations in order to avoid unpre-dictable memory consumption as well as run-time overhead of dynamic memory allocation. In that way, changing the size of matrices is still possible (and is commonly used for constructingmatrices) but only the maximum size is limited. Data type usage Scilab supports integer data types of various bit widths but theyare not used in common practice. End users typically rely on float-ing-point data types. In general, floating-point operations areslower and less energy efficient than fixed-point data operations.Additionally, the corresponding floating-point unit within a pro-cessor consumes a significant amount of die area making the pro-cessor more expensive. The usage of integer or fixed-pointoperations can speed up computation and avoids using of expen-sive hardware but is comingalong witha loss of accuracy or a lim-itedvariablerange.Therefore,weprovideScilabannotationstotheend user to specify the dynamic range of variables and the maxi-mumquantizationerrorcausedbythereducedaccuracy.Withthisadditional information, the ALMA toolchain is able to automati-cally select appropriate integer data types for floating-point vari-ables. The integer operations can then be further optimized byusing SIMD instructions. Pointer free A pointer (also called reference) is a programming language datatype whose value refers directly to (or ‘‘points to’’) another valuestoredelsewhereinthecomputermemoryusingitsaddress. Scilabis a pointer-free language in common practice, i.e. a typical enduser does not use pointers for expression algorithms. That is incontrast to the C programming language that requires pointerse.g. for strings, efficiently passing of values to functions or return-ing more than one variable from a function. In contrast to manycommon programming languages, Scilab allows to specify morethan one output parameter per function, thus enabling the poin-ter-free programming model. All function input parameters arecall-by-value in Scilab since there exist no pointers for realizingcall-by-reference. The abdication of pointers within the Scilab lan-guage is very beneficial for compiler optimization since it avoids 1034  T. Stripf et al./Microprocessors and Microsystems 37 (2013) 1033–1049  the pointer aliasing problem. Pointers can be dynamically changedat run time and thus making it – in general – impossible to deter-mine where a pointer points at compile time. Aliasing refers to thesituation where the same memory location can be accessed usingdifferent names. Since a pointer can point to any variable or tothe same location as another pointer, a compiler does not knowthe side effects of a pointer access. It is thus not allowedto reorderpointer accesses and that finally limits the exploitation of instruc-tion- and thread-level-parallelism within the compiler.  2.1. ALMA-specific Scilab extension The ALMA toolchain recognizes an extended version of the Sci-lab language in order to assist the automated mapping of Scilabspecification to a multicore system. The standard Scilab develop-ment environment works as an interpreter with dynamic typechecking meaning that type context of expressions is validatedduring execution. Matrices, which are the dominant data type,may change their type and size at run time by simple assignmentstatements. ALMAallowsinthe same waymatrices dynamicresiz-ingbutrestrictsthematrices’elementstypetobedeterminedstat-ically at compile time. For this reason, ALMA input specificationcomposes of a declarative section accepting type and variable dec-larations using the CDecl language and a Scilab language sectionaccepting Scilab programs. The two regions lie in a single file(.sce) and are separated by the //%% delimiter with the declarativeregion being first in sequence.The Scilab compiler engine of ALMA translates Scilab sourcecode to annotated C code. The parser supports every specified fea-ture of Scilab 5.3.3. However, idiosyncratic elements of Scilab suchas embedded C code blocks are not supported. More specifically,the language features as can be seen in Table 1 are supported.The CDecl language is an extension of a specific subset of theC89 declarative syntax to adapt to Scilab compilation require-ments. The subset of the C language declarations includes declara-tive statements for arrays, character strings, and functions. Scalarsare considered as single element arrays while one-dimensional ar-rays are modeled as row or column vectors. Using the widelyknown C language declarative syntax has the advantage of requir-ing minimal effort for Scilabdesigner to start developing programsfor ALMA. Every variable or function should be declared beforeappearing in the Scilab section. The user should declare the typeand an upper bound (static size) for the size of matrix variables.The size is dynamically allocated during program initializationand refers to a steady data pool where the matrix data reside inmemory. The dynamic size of the array cannot exceed the staticsize declared in the declarative region. Moreover, matrix variablesmay reside in either global or function scope. For every global var-iable in Scilab, a declaration of the following type is made in CDeclwhere for matrix A a size of 10*10 integers is allocated upon pro-gram initialization: int A ½ 10 ½ 10  ; ThefollowingdeclarationdepictsthedeclarationofmatrixBin-sidethescopeoffunctionfoo. Inthiscase, asizeof 10*100integersis reservedduringprograminitialization. Thescopeoperator ::  hasbeen adopted in CDecl language to state the scope of the declaredvariable. int foo  ::  B ½ 10 ½ 100  ; Scilab function declaration required the adaptation of C declar-ative syntax to handle the case of multiple output parameters.CDecl language has two special specifiers, in and out, for declaringwhether a function parameter is an input or an output parameter.Their usage is shown on the example below: int foo1 ð in int    gfa ; in int    fb ; out int    k Þ ; where this declaration stands for Scilab function function k  ¼  foo1 ð gfa ; fb Þ It is important to notice that formal function parameter vari-ables inherit the size of the actual function parameter variablesduring a function call. For this reason, formal function parametershave their size unspecified (declared as double pointers) while thetype declaration of the elements is mandatory.Moreover, animportantpointforendusersisthepolicyregard-ing the support of Scilab intrinsic functions. Scilab uses two formsof intrinsic functions: (a) ‘‘fundamental’’ ones written in C lan-guage and (b) ‘‘derived’’ ones written in Scilab and accessible fromthe single input specification file.Finally, the ScilabFront-Endtools (SAFE) assists the subsequentautomatic coarse- and fine-grain parallelism exploitation enginesby transferring user information regarding task identification inthe form of SCILAB comments. 3. ALMA toolset overview The ALMAtoolset provides an end-to-end toolchainfromScilab[5] code to executable code on embedded MPSoC platforms. Thetypical use case involves an end user to develop and provide anapplication in an ALMA-specific Scilab dialect, as well as an ab-stract description of the target description using the ALMA  Archi-tecture Description Language  (ADL), described in Section 4. TheALMA-specific Scilab dialect is a subset of the Scilab language, en-hancedwithcomment-typeannotations, andisoutlinedinthispa-per in Section 2. In the above use case, the ALMA toolset willproduce parallelized executable code ready to run on the desig-nated multicore embedded platform.The ALMA toolset workflow is presented in Fig. 2. TheALMA-specific Scilab dialect source code is consumed by the  ScilabFront-End  (SAFE),whichproducesaCrepresentationoftheoriginalcode. Next, the C code is loaded into the GeCoS open source com-pilerframework[6]andisconvertedtotheALMA-specific Interme-diate Representation  (IR). The ALMA-specific IR is a GeCoS IR,extended to meet the needs of the ALMA project. Several transfor-mations are applied to the ALMA-specific IR, implemented as Ge-CoS passes, before platform-independent MPSoC code isproduced. The  fine-grain parallelism extraction  step, described inSection 7, targets the exploitation of the  Single Instruction ,  MultipleData  (SIMD) instructionset of the underlyingMPSoCarchitectures,addressing the data type selection and memory access aware vec-torization problems. The  coarse-grain parallelism extraction andoptimization  step, described in Section 6, analyzes and modifiesthe  Control and Data Flow Graph  (CDFG) in order to cluster, parti-tion and schedule subgraphs to the available cores taking into ac-count temporal and spatial constraints imposed by thearchitecture,thecomputationalload, andthememorytransactionsof the various tasks. The parallelism extraction steps rely on the  Table 1 Overview of supported Scilab language constructs. LanguageconstructsSupportStatements Assignment, function definition, return, if, while, select, for,breakExpressions Expression primitives (integer, float, decimal, string literal,logical value), identifier, functioncall, matrix, parenthesizedexpression, negative expression, positive expression, !,operators within expressions (:,  ^ , . ^ , ‘‘, ’’, +,   ,  ⁄ , /,  n , . ⁄ , . ⁄ ., ⁄ .[0–9], ./, ./., /.[0–9], . n , . n .,  n .[0–9], &, &&,  j ,  k , ==,   =, @=, <>,<, >, <=, >=)Declaration – T. Stripf et al./Microprocessors and Microsystems 37 (2013) 1033–1049  1035  platform ADL description, which is available to them through the  ADL Compiler  . In addition, the  ALMA Multi-Core Simulator  , whichis an abstraction of the platform specific simulators, assists thecode optimization steps by providing more accurate performanceestimations.ThediagraminFig. 1showstheALMAapproachfromthetargetMPSoC perspective. The figure distinguishes between hardwareand software. On the bottom, embedded MPSoC architectures areimplemented, such as platforms based on Recore’s reconfigurableDSP cores or Kahrisma [7–9] cores. The ALMA toolset from Fig. 1 shows how the ALMA approach from Fig. 2 is integrated with themulticore hardware/simulator. Fig. 1 depicts that the output of the ALMA tools (e.g. Fig. 2) is C-based code with parallel descrip-tions. This C-based code is taken as input for the target multicoreplatform hardware-specific compilers (e.g. Recore Xentium com-piler, Kahrisma compiler, etc.). The executable binaries that arecreated by the hardware specific compilers can be run in the mul-ticore simulators or can be directly executed on the multicorehardware.AnabstractADLdescriptionofthemulticorehardwarearchitec-ture will be used as an input for the ALMA approach. The ADL pro-vides two goals:1. ADL defines an abstract hardware description of the multicorehardware target. This abstract information is used to build amulticore simulation environment for the multicore hardwaretarget.2. Additional characteristics about the multicore hardware aredefined which will be used during the optimization steps of the ALMA tools. 4. ALMA architecture description language The ALMA  Architecture Description Language  (ADL) is a funda-mental component of the ALMA toolset and is used by all othercomponents as a central database to gather information aboutthe current target architecture. The ADL is a key component toenable the target independence of the overall ALMA tool flow.Withintheproject, thearchitectureindependenceis shownbytar-geting two different architectures. Beyond that, the ADL-based ap-proachenablestheextensibilityoftheALMAtoolsettoothertargetarchitectures aswell asparametricdesign-spaceexplorationof thetargettemplates.TheADLservesasthehardwaredescriptioninputfor the ALMA approach and therefore it provides the followingfeatures:   Abstract hierarchical structural description  for simulation of multi-core architectures.   Behavioral annotations  to the structural description for com-piler-oriented application mapping to multi-core targetarchitectures.   Microarchitecture, resource and instruction set description  forper-formance estimation, SIMD instruction selection and platform-specific C code generation.   Configuration description  for supporting reconfigurablearchitectures.   Extensibility  by using a special markup language.   Compact description  of regular structures using loop and condi-tional constructs.   Parameterizable description  by using variables.While there exists several ADLs for MPSoCs [10–12], none issuitable to fulfill the special requirements of the ALMA toolsetincluding structural specification for simulation and behavioralinformation for compilation. Therefore, we developed a novelADL that is tailored for the special needs of the ALMA projectand the ALMA tools described within the following sections. 4.1. ADL data description The ADL is based on a special markup language for coding hier-archical structured data in a text document. It is comparable toXML [13]andJSON[14]andcreatesatreerepresentationofthede- scribed data. The language uses scalar data types as leaf nodes and Fig. 1.  ALMA toolset from an end user perspective.1036  T. Stripf et al./Microprocessors and Microsystems 37 (2013) 1033–1049  vector or object containers as inner nodes. While elements in avector container are referenced with numbers, the elements insidean object can be referenced with a string key. Furthermore, thedata description language offers the flexibility to use variables aswell as constant mathematical expressions, for-loops and condi-tional constructs. This enables the flexibility to describe regularMPSoC structures very abstract. After variable propagation, math-ematical expression calculation, and ‘‘for’’/‘‘if’’ statement interpre-tation, the format can be converted to an XML or JSONrepresentation and is thus further reusable. 4.2. ADL architecture description Based on the markup language, the structure of the ADL description is specified. The ADL is structured in various majorsections that allow the specification of the ALMA target architec-tures from a structural perspective annotated with behavioralinformation. Thereby, we rely on the concept of modules, in-stances, and connections as widely used by hardware descriptionlanguages such as VHDL or SystemVerilog but without describingthe individual modules and connections on bit- or  Register Trans- fer Level  (RTL) granularity. Instead, the modules and connectionsare only specified in an abstract fashion in order to enable theanalyzability that would be nearly impossible for a lower levelof abstraction.In detail, the ADL comprises the following top sections: Global  is used for global architecture definitionslike base frequency. In addition to that, aboot configurationcan be defined for recon-figurable architectures. Interfaces  is a library of usable connection types. Aninterface connects two or more moduleswith predefined ports and can providebehavioral information about connectiontype, transmission constraints, throughput,and other connection details. Modules  is a library of available system parts,describing their behavior and functionality.Modulescanbeinstantiatedandcanbecon-nected by ports using interfaces. A singlemoduleconsists of aport definition, simula-tion information and one or more behav-ioral annotations that define differentmodule properties. Additionally, a modulecanhierarchicallyinstantiateother modulesthat are implemented as submodules.  TopLevel  is a special base module of each systemdescription. In this part of the systemdescription the top-level modules areinstantiated and connected via interfaces. Configurations  allows expressing reconfigurable architec-tures that can change the functionality of one or a group of modules. A configurationconsists of required modules and their con-nections as well as the functionality of thegrouped modules. Microarchitectures  specifies information about one or moreprocessorarchitectures.Amicroarchitectureis referenced within the  Core  behavioraltype annotated to modules orconfigurations. 4.3. Behavioral annotation Thestructural specificationofthetargetarchitecturewithintheADL is annotated with behavioral information. A behavioral anno-tationcanbeappliedto a Module ,  Configuration  or  Interface  (seeTable 2 column three). A behavioral annotation can consist of oneormorebehavioraltypes.Abehavioraltypecategorizese.g. amod-ule as memory, cache, network router or core (processing ele-ment). Each Type is described by a set of different properties, e.g.a  Memory   type would include the size and delay properties. Anoverview of the possible behavioral types and their supportedproperties is given in Table 2.The behavioral annotations do not represent an exact specifica-tion of the system’s behavior. They rather provide an approximatedescription for optimizing application mapping to as well as per-formance estimation of the target architecture. The behavioralannotationsareallwell-definedinordertoenabletheiranalyzabil-ity within the ADL Compiler. The structural description as well asnon-structural but behavioral information is extracted by theADL Compiler. To enable accurate simulations additional simula-tion parameters are available. 4.4. Microarchitecture description A microarchitecture description is a special behavioral annota-tion, as specified in Section 4.3, to describe a processor core. Thisdescription contains the available data types (i.e. the register for-mats), the resources within the processor pipeline, the instructionset and some compiler-specific information. Fig. 2.  ALMA toolset overview from a technical perspective. T. Stripf et al./Microprocessors and Microsystems 37 (2013) 1033–1049  1037
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks