A lightweight speculative and predicative scheme for hardware execution

A lightweight speculative and predicative scheme for hardware execution
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  978-1-4673-2921-7/12/$31.00 c  2012 IEEE A LIGHTWEIGHT SPECULATIVE AND PREDICATIVE SCHEME FOR HARDWAREEXECUTION  Razvan Nane, Vlad-Mihai Sima and Koen Bertels Computer Engineering LabDelft University of Technologyemail:   r.nane,v.m.sima,k.l.m.bertels   @tudelft.nl ABSTRACT If-conversion is a known software technique to speedup ap-plications containing conditional expressions and targetingprocessors with predication support. However, the successof this scheme is highly dependent on the structure of the if-statements, i.e., if they are balanced or unbalanced, as wellas on the path taken. Therefore, the predication schemedoes not always provide a better execution time than theconventional jump scheme. In this paper, we present analgorithm that leverages the benefits of both jump andpredication schemes adapted for hardware execution. Theresults show that performance degradation is not possibleanymore for the unbalanced if-statements as well as aspeedup for all test cases between 4% and 21%. I. INTRODUCTION As the increase in frequency of the general purposeprocessors is becoming smaller and harder to obtain, newways of providing performance are investigated. One of thepromising possibilities to improve the system performanceis to generate dedicated hardware for the computationintensive parts of the applications. As writing hardware in-volves a huge effort and needs special expertise, compilersthat translate directly from high level languages to hardwarelanguages have to be available, before this method is widelyadopted. As C and VHDL are the most popular usedlanguages in their fields, of embedded and hardware systemdevelopment respectively, we will focus on compilers forC-to-VHDL. The algorithm presented here can be appliedin theory to any such compiler. A C-to-VHDL compiler canshare a significant part with a compiler targeting a generalpurpose architecture, still, there are areas for which thetechniques must be adapted to take advantage of all thepossibilities offered.In this context, this paper presents an improved predica-tion algorithm, which takes into account the characteristicsof a C-to-VHDL compiler and the features available on thetarget platform. Instruction predication is an already knowncompiler optimization technique, however, current C-to-VHDL compilers do not take fully advantage of the possi-bilities offered by this optimisation. More specifically, wepropose a method to increase the performance in the case of unbalanced if-then-else branches. These types of branchesare problematic because, when the jump instructions areremoved for the predicated execution, if the shorter branchis taken, slowdowns occur because (useless) instructionsfrom the longer branch still need to be executed. Based onboth synthetic and real world applications we show thatour algorithm does not substantially increase the resourceusage while the execution time is reduced in all the casesfor which it is applied.The paper is organized as follows. We begin by present-ing a description of the predication technique and previousresearch, emphasizing on the missed optimization possibil-ities. In Section III we present our algorithm and describeits implementation. The algorithm is based on a lightweightform of speculation because it does not generate logic toroll back speculated values. It employs a lightweight formof predication because only some branch instructions arepredicated, as well as keeping jump instructions. SectionIV discusses the results and Section V concludes the paper. II. RELATED WORK AND BACKGROUND Given the code in Fig. 1 (a), the straightforward way of generating assembly (or low level code) is presented in Fig.1 (b). We note that for any of the two branches there is atleast one jump that needs to be taken. If the block executionfrequency is known, an alternative approach exists in whichthe two jumps are executed only on the least taken branch.Branches are a major source of slowdowns when usedin pipelined processors as the pipeline needs to be flushedbefore continuing. Furthermore, branches are also schedul-ing barriers, create I-cache refills and limit compiler scalaroptimizations. In order to avoid this negative effect, theconcept of predication was introduced, which does not alterthe flow but executes (or not) an instruction based on thevalue of a predicate. An example is given in Fig. 1 (c). Inthis scheme no branches are introduced, but, for a singleissue processor more (useless) instructions are executed. Incase of a multiple issue processor such instructions can be”hidden” because the two code paths can be executed in  parallel. We emphasize that the advantage of the predica-tion comes from the fact that there are no branches in thecode. if (x)r = a + b;elser = c - d; (a) cond = cmp x,0branchf cond, elseadd r,a,bbranch endelse:sub r,c,dend: (b) cond = cmp x,0[cond] add r,a,b[!cond] sub r,c,d (c) Fig. 1 . (a) C-Code; (b) Jump- ; (c) Predicated-Scheme.The predication schemes assumes that the penalty of the jump is huge and thus branching has to be avoided.This is no longer true in the case of VHDL code. Forthe VHDL code there are no ”instructions” but states ina datapath, controlled by a Finite State Machine (FSM).A straightforward implementation in VHDL of the jumpscheme is presented in Fig. 2. We will present in thelater sections the implications of the fact that the jumpsare not introducing a huge delay. For this case, applyingpredication decreases the number of states from 4 to 2.We will show in the later sections how our algorithm canreduce the number of states even for unbalanced branches,a case not treated in the previous work.A seminal paper on predication is [3], where a genericalgorithm is presented that works on  hyperblocks  whichextends the concept of basic blocks to a set of basic blocksthat execute or not based on a set of conditions. It proposesseveral heuristics to select the sets of the basic blocks aswell as several optimizations on the resulted hyperblocksand discusses if generic optimizations can be adapted to the hyperblock   concept. Compared to our work, their heuristic datapathstate_1:cond = cmp x,0state_2:state_3:r=a+b;state_4:r=a-b;state_5:.... -- code after if-statementFSMstate_1:next_state = state_2state_2:if(cond)next_state = state_3elsenext_state = state_4state_3:next_state = state_5state_4:next_state = state_5state_5:.... Fig. 2 . Jump Scheme void balanced_case(int *a, int *b, int *c, int *d,int *e, int *f, int *result)  { if (*a > *b)*result = *c + *d;else*result = *e - *f; } Fig. 3 . Balanced if branches.does not consider the possibility of splitting a basic block and does not analyse the implications for a reconfigurablearchitecture, e.g. branching in hardware has no incurredpenalty.The work in [4] proposes a dynamic programmingtechnique to select the fastest implementation for if-then-else statements. As with the previous approach, any changein the control flow is considered to add a significantperformance penalty. In [5], the authors extend the predica-tion work in a generic way to support different processorarchitectures. In this work, some instructions are movedfrom the predicated basic blocks to the delay slots, but asdelay slots are very limited in nature there is no extensiveanalysis performed about this decision.Regarding the C-to-VHDL compilers, we mention Al-tium’s C to Hardware (CHC) [6] and LegUp [7]. Theytranslate functions that belong to the application’s compu-tational intensive parts in a hardware/software co-designenvironment. Neither of these compilers considers specif-ically predication coupled with speculation during thegeneration of VHDL code. III. SPECULATIVE AND PREDICATIVEALGORITHM In this section, we describe the optimization algorithmbased on two simple but representative examples whichillustrate the benefit of including Speculative and Predica-tive Algorithm (SaPA) as a default transformation in HighLevel Synthesis (HLS) tools. III-A. Motivational Examples To understand the problems with the predication scheme(PRED) compared to the jump scheme (JMP), we usetwo functions that contain each one if-statement. Thefirst, shown in Fig. 3, considers the case when the then-else branches are balanced, i.e. they finish executing theinstructions on their path in the same amount of cycles,whereas the second case deals with the unbalanced scenario(Fig. 4). In these examples, we assume the target platformis the Molen machine organisation [2] implemented on aXilinx Virtex-5 board. This setup assumes that three cyclesare used to access memory operands, simple arithmetic(e.g. addition) and memory write operations take one cycle,whereas the division operation accounts for eight cycles.The FSM states corresponding to the two examplesare listed in Fig. 5(a) and 5(b). For each example, the  S1: ld *a S2: ld *bS4: read a;S5: read b;S6: TB = cm!"t (a#b)S$: if (TB) %m S16;S&: ld *e;S': ld *f; S11: read e;S12: read f;S1: result = e-f;S14: rite result;S15: %m S2;S16: ld *c;S1$: ld *d; S1': read c;S2: read d;S21: result = c+d;S22: rite result;S2: retur;S1: ld *a S2: ld *bS4: read a;S5: read b;S6: TB = cm!"t (a#b)S$: TB , ld *c : ld *e;S&: TB , ld *d : ld *f;S1: TB , (read) c : e;S11: TB , (read) d : f;S12: if (TB) result = c+d; else result = e-f;S1: rite result;S14: retur;S1: ld *a S2: ld *bS: ld *cS4: read a; ld *d;S5: read b; ld *e;S6: read c; ld *f; TB = cm!"t (a#b)S$: read d;S&: read e;S': read f;S1: if (TB) result = c+d; else result = e-f;S11: rite result;S12: retur;(1) ./!B(2) /0!B() Sa/3!B (a) Balanced Example S1: ld *a S2: ld *bS4: read a;S5: read b;S6: TB = cmp_gt (a,b)S7: if (TB) mp S16;S!: ld *e;S": ld *f; S11: read e;S12: read f;S1#: re$%lt = e&f;S14: 'rite re$%lt;S15: mp $#2;S16: ld *c;S17: ld *d; S1": read c;S2: read d;S21: tmp = cd;S22: +T  tmp-5;S#: re$%lt. tmp-5;S#1: 'rite re$%lt;S#2: ret%r/;S1: ld *a S2: ld *bS4: read a;S5: read b;S6: TB = cmp_gt (a,b)S7: TB 0 ld *c : ld *e;S!: TB 0 ld *d : ld *f;S1: TB 0 (read) c : e;S11: TB 0 (read) d : f;S12: tmp = cd; f (TB) re$%lt = e&f;S1#: +T tmp-5;S21: if (TB) re$%lt . tmp-5;S22: 'rite re$%lt;S2#: ret%r/;S1: ld *a S2: ld *bS#: ld *cS4: read a; ld *d;S5: read b; ld *e;S6: read c; ld *f; TB = cmp_gt (a,b)S7: read d;S!: read e;S": read f;S1: if (TB) tmp = cd; el$e  re$%lt = e&f; mp S2; 3S11: +T tmp-5;S1": re$%lt . tmp-5;S2: 'rite re$%lt;S21: ret%r/;(4) _8(5) 9_8(6) Sa<_8 (b) Unbalanced Example Fig. 5 . Synthetic Case Studies. void unbalanced_case(int *a, int *b, int *c, int *d,int *e, int *f, int *result)  { int tmp;if (*a > *b)  { tmp = *c + *d;*result = tmp /5;  } else*result = *e - *f; } Fig. 4 . Unbalanced if branchesfirst column represents the traditional jump scheme ((1)and (4)), the middle columns ((2) and (5)) represent thepredicated one and columns (3) and (6) shows the SaPAversion. This column will be explained in more detail inthe next section, as it is presenting the solution to theproblem described here. Because each state executes inone cycle, the first five states are needed to load the  a and  b  parameters from memory. In the first two states,the address of the parameters is written on the memoryaddress bus. State three is an empty state and therefore isnot shown in the figures. Finally, in states four and five thevalues of the parameters are read from the data bus. Theseoperations are common for all possible implementations(i.e. for all combinations of balanced/unbalanced case studyand JMP/PRED/SaPA schemes) shown by the (1) to (6)numbered columns in the two figures. Subsequently, thethen-branch (TB) predicate is evaluated for the JMP cases(column (1) and (4)). Based on this value, a jump can bemade to the then-branch states (states 16 to 22), or, incase the condition is false, execution falls through to theelse-path (states 8 to 15). The number of states requiredfor the unbalanced case, i.e. (4) JMP U, is larger due tothe additional division operation present in the then-branch.That is, in state 22 we initialize the division core withthe required computation, whereas in state 30 we read theoutput.Applying the predication scheme to the balanced exam-ple results in a reduction in the number of states. Thisis achieved by merging both then- and else-branches andby selecting the result of the good computation based onthe predicate value. This optimization is ideal for HLStools because decreasing the number of states reduces thetotal area required to implement the function. For theexamples used in this section, a reduction of nine stateswas possible, i.e. when comparing (1) and (4) with (2)and (5) respectively. However, because branches can beunbalanced, merging them can have a negative impact onperformance when the shorter one is taken. For example incolumn (5) PRED U, when the else-path is taken, states13 to 21 are superfluous and introduce a slowdown for theoverall function execution.Fig. 6 shows all possible paths for both examples aswell as their execution times in number of cycles, e.g.from state 1 to state 23. TE represents the execution  T imefor the  E lse path while TT is the  T ime when the  T henpath is taken. The upper part of the figure corresponds tothe balanced if-function and the lower for the unbalancedcase. Furthermore, there is a one-to-one correspondencebetween the columns in Fig. 5 and the scenarios in Fig. 6.The numbers on the edges represent the number of cycles    *+ *+readreadcmp_gt*+ *+read read) &-./0 -./0 (a) Predicated Graph   *+ *+readreadcmp_gt*+ *+read read) &-./0 -./0    end&if .n!ert in !h     rt branch, if an    (b) SaPA Graph Fig. 8 . Data Dependency Graphs.removed here. That is, for simple expressions with localscope, dependency edges coming from the if-predicate arenot needed. These expressions can be evaluated as soon astheir input data is available. We name this lightweight spec-ulation because by removing the control flow dependenciesfrom the if-predicate we enable speculation, however, wedo not introduce any code to perform the roll back in casethe wrong branch was taken as this is not necessary in ourhardware design. The dependencies to the memory writeshowever remain untouched to ensure correct execution. Fig.8(b) exemplifies the removal of unnecessary dependencyedges for the balanced case study. It shows as well in thedashed box the (optional) insertion of the jump instruction,that in the case illustrated, was not necessary.Finally, the scheduler is executed which can schedulethe local operations before the if-predicate is evaluated,i.e. speculating. Furthermore, when the FSM is constructed,whenever the branches become unbalanced, a conditional jump instruction is scheduled to enforce the SaPA be-haviour. If the predicate of the jump instruction is true, theFSM will jump to the end of the if-block, therefore avoidingextra cycles to be wasted if the shorter branch was taken.This ensures no performance degradation is possible withthis scheme. If the predicate is false, the default executionto move to the next state is followed. IV. EXPERIMENTAL RESULTS The environment used for the experiments is composedof three main parts: i) The C-to-VHDL DWARV 2.0 com-piler [1], extended with the flow presented in Section III,ii) the Xilinx ISE 12.2 synthesis tools, and iii) the XilinxVirtex5 ML510 development board. This board containsa Virtex 5 xc5vfx130t FPGA consisting of 20,480 slicesand 2 PowerPC processors. To test the performance of the algorithm presented, we used seven functions. Twoare the simple synthetic cases introduced in the previoussection, while the other five were extracted from a libraryof real world applications. These applications contain bothbalanced and unbalanced if-branches.Fig. 9 show the speedups of the PRED and SaPAschemes compared to the JMP scheme. The  balanced  function shows how much speedup is gained by combiningthe predication scheme with speculation. Similar to this,the speedup of the  unbalanced   function, tested with inputsthat selects the longer branch (LBT), shows a performanceimprovement compared to the JMP scheme due to specu-lation. However, when the shorter branch is taken (SBT),the PRED scheme suffers from performance degradation.Applying the SaPA scheme in this case allows the FSMto jump when the shorter branch finishes and thereforeobtaining the 1.14x speedup.The execution times for both schemes for the  gcd  function are the same because both paths found in thisarithmetic function have length one, i.e. they performonly one subtraction each. Therefore, the only benefitis derived from removing the jump-instruction followingthe operands comparison. It is important to note thatapplying speculation is useful in all cases where bothbranches and the if-predicate computation take more thanone cycle. Otherwise, the PRED scheme is enough toobtain the maximum speedup. Nevertheless, the speedupthat can obtained by simply predicating the if-statementand thus saving one jump instruction per iteration can beconsiderable, e.g. 20% when the gcd input numbers are12365400 and 906.  mergesort   provided a test case with abalanced if-structure where each of the paths contains morethan one instruction. Therefore, the benefit of applyingSaPA was greater than using the PRED scheme. Thisexample confirms that whenever the paths are balanced,the application can not be slowed-down.Finally, the last three cases show results obtained for un-balanced cases with inputs that trigger the shorter branchesin these examples. As a result, for all these functions thePRED scheme generates hardware that performs worse thanthe simple JMP strategy. Applying the SaPA algorithmand introducing a jump instruction after the short branchwill allow the FSM to break from the ”predication mode”execution of the if-statement and continue further withexecuting useful instructions.To verify that the presented algorithm does not introducea degradation of other design parameters, i.e. area and fre-quency, we synthesized all test cases using the environmentdescribed in the beginning of this section. Table I summa-
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks