Screenplays & Play

Utilizing Custom Registers in Application-specific Instruction Set Processors for Register Spills Elimination

Utilizing Custom Registers in Application-specific Instruction Set Processors for Register Spills Elimination Hai Lin Dept of Electrical & Computer Engineering University of Connecticut Storrs, CT 06269,
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Utilizing Custom Registers in Application-specific Instruction Set Processors for Register Spills Elimination Hai Lin Dept of Electrical & Computer Engineering University of Connecticut Storrs, CT 06269, USA Yunsi Fei Dept of Electrical & Computer Engineering University of Connecticut Storrs, CT 06269, USA ABSTRACT Application-specific instruction set processor (ASIP) has become an important design choice for embedded systems It can achieve both high flexibility offered by the base processor core and high performance and energy efficiency offered by the dedicated hardware extensions Although a lot of efforts have been devoted to computation acceleration, eg, automatic custom instruction identification and synthesis, the limited on-chip data storage elements, including the register file and data cache, have become a potential performance bottleneck In this paper, we propose a hardware/software cooperative approach and a linear scan register allocation algorithm to utilize the existing custom registers in ASIPs for eliminating register spills The data traffic between the processor and memory can be reduced through efficient on-chip communications between the base processor core and custom hardware extensions Our experimental results demonstrate that a promising performance gain can be achieved, which is orthogonal to improvements by any other technique in ASIP design Categories and Subject Descriptors C1m [Processor Architectures]: Miscellaneous General Terms Design Keywords ASIP, register spill, custom register 1 INTRODUCTION In recent decades, application-specific instruction set processors (ASIPs) have been more and more popularly used in embedded system design to satisfy demanding requirements on performance, Acknowledgments: This work was supported by an NSF grant CCF Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee GLSVLSI 07, March 11 13, 2007, Stresa-Lago Maggiore, Italy Copyright 2007 ACM /07/0003 $500 power consumption, cost, and turn-around time ASIPs allow designers to customize both the instruction set architecture (ISA) and underlying microarchitecture for a specific application domain A programmable base processor core can be extended to incorporate dedicated hardware accelerators for applications, thus, both high flexibility and application-oriented high performance and energy efficiency can be achieved in ASIPs [18] There have emerged several commercial tools to take various configurable and extensible processors from specification to hardware implementation, such as Tensilica Xtensa [1], ARCtangent Processor [2], Jazz DSP [3], Altera Nios/NiosII [4], and Xilinx MicroBlaze [5] A crucial step to achieve high performance in ASIP design is to select an optimal set of custom instructions with the best speedup under certain architectural constraints, eg, area constraints, critical path time constraint, the limited number of input and output operands, etc Three techniques, Very Long Instruction Word (VLIW), vector operations, and fused operations, have been exploited for possible custom instructions to achieve best tradeoff between performance improvement and hardware cost [15] Various techniques have been presented to explore the design space of custom instructions efficiently and thoroughly Sun et al used a priority function to compute rankings of the candidate instructions, and employed a branch-and-bound algorithm to prune inferior candidates [22] To speedup the exploration process, Clark et al added a configurable array of functional units to the baseline processors that enables acceleration of a wide range of applications [9] A set of algorithms for pattern generation, pattern selection, and application mapping are also employed for reconfigurable systems [11, 17] Several exact exhaustive algorithms (eg, a binary tree to determine whether or not to include an operation node in a candidate instruction) and approximate algorithms (eg, a genetic algorithm) have been summarized in [20] The ASIP architecture and compiler co-exploration problem is addressed in [13] In a simple RISC-style processor, operations are performed only on register data and memory accesses are restricted to load and store instructions In an ASIP implementation, normally the hardware extensions need to obtain or update data in the generic register file of the base core Base instructions use at most two input operands and one output, which is determined by the number of read and write ports available on the register file However, previous studies have shown that generally a significant performance gain of custom instructions comes from clusters with more than two input operands, eg, 4 5 makes the best results [24] To reconcile the data bandwidth mismatch between the base processor core and potential custom extensions, local storage elements - designer-defined custom registers can be generated to hold the extra input operands needed by custom instructions [14, 22] Each additional input has to be loaded from the register file to a cus- tom register explicitly by a move instruction The data traffic between the base processor core and custom extensions can significantly offset the performance gain from selecting a complex cluster A shadow register technique has been presented to mitigate the data bandwidth limitation in the configurable processor [10, 12] They made writing to custom registers coincide with the write-back pipeline stage of base instructions, so that the cycle overheads for data transferring are removed We observe that in this shadow register scheme, custom registers are local to the hardware extensions, where only the custom instructions are allowed to use them for extra input operands The reverse direction of usage, using the custom registers for base instructions, has not been exploited As indicated in [23], choice of a sufficient number of registers has a significant impact on the code size, performance, and energy consumption of embedded processors Since multiple constraints of embedded processors, like area, power, etc, impose a limitation on the size of the register file, generic registers have become a scarce on-chip resource For example, the ARM7DMI only has 16 general registers and 8 are used for the THUMB ISA [23] In this paper, we propose a novel approach to turn the custom registers to a register file extension, so that the base instructions can use the data stored in custom registers instead of going to memory system, possibly reducing the memory traffic, and hence execution time and energy consumption The remainder of the paper is organized as follows Section 2 analyzes the register spill and memory traffic problem Section 3 describes the proposed hardware/software cooperative approach and its implementation within an ASIP synthesis framework Section 4 presents experimental results, followed by conclusions and future work in Section 5 2 ANALYSIS OF MEMORY TRAFFIC PROBLEM Figure 1 shows the partial datapath of a typical extensible processor The left part illustrates the base architecture that we target in this paper, which is a simple single-issue 5-stage pipeline RISC processor core with a generic register file with two read ports and one write port In the figure, we do not show the pipeline stages explicitly, and the tri-buffers refer to control signals The right part depicts the custom hardware extensions, where custom logic is added as computation accelerators, and several custom registers are added to feed extra inputs to the custom logic Previous studies have shown that relaxing the input operands constraint to approximately 4 can achieve performance gain close to the theoretical limit [24] Thus, normally we need two custom registers for extra inputs The shadow register technique makes the custom registers visible to the base instructions for writing, and uses the reserved field of an instruction for controlling (either skipping or forwarding values to) the shadow registers [10] In the microarchitecture, as shown in Figure 1, when sg is set high for writing a result to the register file, sc can also be asserted to copy the result to a custom register However, in this approach, only the data needed by custom instructions are forwarded to custom registers If we mark the activation periods of a custom register during program execution, they tend to be sparsely scattered, and there exists a lot of idle time for the custom register On the one hand, custom registers in hardware extensions are under-utilized during program execution On the other hand, the generic registers in the base processor core are a scarce resource When program variables have exhausted the register file, some register values have to be stored in the memory system temporarily (so-called register spills) so that the registers can be reclaimed for other variables Later on, the spilled values can be loaded back to sg! br $# 3 L M N)O )( P bt H I F ( load/store : ;= 7 6 62;:3=?;: 7 -/ 0 sc %&'( )* + ', bc %&',()KJ?( 6 B ;C*D5:FEG&5 : 7 Figure 1: Partial datapath of a typical extensible processor the register file These register spill instructions will result in code size increase and more instructions executed The traffic between the processor core and off-chip memory may also increase, degrading performance and increasing energy consumption We propose a novel approach to utilize custom registers as an extension of the generic register file The custom registers not only hold data for custom instructions, but also for base instructions Thus, a large number of register spills will get eliminated We will evaluate the saving in dynamic executed instructions and reduction in off-chip memory accesses by applying an effective custom register binding algorithm 3 HARDWARE AND SOFTWARE IMPLEMENTATIONS In this section, we first discuss about software design to utilize custom registers for base instructions, then we formulate the problem of post-compilation custom register binding and propose a heuristic for it, and finally we describe the implementation within an ASIP synthesis framework 31 Software Design to Utilize Custom Registers Figure 2 gives an example code sequence with register spill instructions We assume register allocation and assignment have been performed in the compilation process Instruction I1 generates a variable which is bound to register r1 Suppose now the register file has been fully occupied and r1 is selected for the next variable, I2 has to spill the value in r1 to a memory address addr1 temporarily Instruction I3 generates a new r1 value and I4 uses it In order for I6 to use the old r1 value, Instruction I5 has to load the value from memory addr1 to r1 (or other registers) Here instructions I2 and I5 form a pair of register spill instructions Figure 3 illustrates changes in the code sequence to utilize custom registers for base instructions If Instruction I1 writes the variable value not only to the destination register r1, but also to the shadow register Cr1 simultaneously, the value does not need to be stored in the memory and Instruction I2 can be removed Instructions I3 and I4 are not affected To get the original value back to r1 for its usage in Instruction I6, I5 just moves the value from Cr1 back to r1 For this code sequence, one pair of store and load instructions are saved, thus, two potential memory references are eliminated A much more lightweight movecb instruction is I1: r1 = ; I2: store r1, addr1 I3: r1= ; I4: = use(r1, ); I5: load r1, addr1; I6: = use(r1); spill in st r u c t io n s Figure 2: An example code sequence with register spill instructions R31 R3 R2 R1 R0 v1 CR1 v2 v3 CR2 v4 CR1 CR3 v5 v6 CR1 CR2 added to transfer data from the custom register to a generic register I1: (r1, Cr1) = ; I2: store r1, addr1 I3: r1= ; I4: = use(r1, ); I5: movecb r1, Cr1; I6: = use(r1); Figure 3: An example code sequence with a register file extension 32 Algorithm for Binding Spill Variables to Custom Registers In our approach, we turn custom registers to a register file extension, so that they are visible to base instructions for both reading and writing With many to-spill variables generated during program execution, and a limited number of custom registers, how to bind the spill variables to the custom registers to achieve maximum saving in memory access remains a challenging problem Figure 4 illustrates the register binding problem We assume that the instruction scheduling and register binding processes for the generic register file have been done by the base core compiler There may be many register spills, defined by pairs of store and load instructions which refer to the same memory address, with the store instruction executes first For those variables to be spilled, ie, candidate variables for custom registers, we annotate the variable define time, DT, as the time when the variable value is assigned to a generic register; and the expiration time, ET, to be the time when this variable value is loaded back from the memory to a register for reuse The actual life-time intervals of all the spill variables and their original host generic registers can be obtained from profiling, as shown in Figure 4, where vi (i = 1,, 6) represent spill variables, CRi (i = 1, 2, 3) are the available custom registers, and Ri (i = 0, 1,, 31) are the generic registers Once a spill variable is selected for custom register binding, it will be written to a certain custom register simultaneously when it is being written to a generic register, as shown in I1 in Figure 3 The custom register holds the variable for its whole life-time until it is needed by an instruction At this time, one register spill has been eliminated, and the custom register is released for other spill variables Note that two variables whose life-time intervals overlap could not be assigned to the same custom register, ie, they conflict with each other in terms of custom register binding For example, v2 and v3 in the figure conflict Our objective is to maximize the number of custom register bindings so that to eliminate as many register spills as possible Figure 4 demonstrates one custom register binding scheme, which has a total number of 6 bindings Figure 4: Custom register binding problem 321 Problem Formulation Given a trace of a program with life-time intervals of all the spill variables, a conflict graph, G(v,e), can be derived to represent conflicts among the spill variables during program execution Figure 5 shows an example conflict graph Assume the program has a total number of n spill variables (where n = 8 in the graph), the vertices v i (i = 1,, n) represent spill variables and an edge exists between v i and v j only if their life-time intervals overlap with each other V4 V1 V2 V5 V7 Figure 5: An example conflict graph and the MISP problem The register binding problem can be reduced to a Maximal Independent Vertex Set Problem (MISP) In the conflict graph G, an independent vertex set is a subset of the vertices such that no two vertices in the subset have an edge between them A maximal independent set is therefore an independent set containing the largest possible number of vertices For m custom registers, finding m independent sets with the largest total number of vertices, where each set is allocated to one custom register, will achieve maximum reduction in register spills In Figure 5, the subset of {v 3, v 4, v 5, v 8 } composes a maximal independent set 322 Linear Scan Register Allocation Algorithm The MISP problem is known to be NP-complete [21] We resort to heuristic algorithms to address the MISP problem We propose a novel register binding algorithm based on linear scan [19] The pseudo-code of the algorithm is given in Algorithm 1 We next describe the algorithm and its implementation Assume we have n spill variables and m custom registers, and each variable has an associated life-time interval of DT, ET As shown in Algorithm 1, we first sort these 2n end time points of the n intervals in an increasing order, and initialize m empty stacks, where each stack stands for one custom register (lines 1-2) V6 V3 V8 t Q Algorithm 1 LinearScanRegisterAllocation Input: life-time intervals of n variables ( DT i, ET i , i = 1, 2,, n), m custom registers (CR 1, CR 2,, CR m ) Output: total number of register bindings regbinds 1: sort all the define time and expiration time of the intervals in an increasing order (t i, i = 1, 2,, 2n); 2: build m empty stacks; 3: for each time point t i do 4: if t i = DT j then 5: push variable j onto all the stacks S 1, S 2,, S m ; 6: end if 7: if t i = ET j then 8: search S 1, S 2,, S m for variable j; 9: if there are stacks which contain variable j then 10: get the set of stacks S h1 to S hk with a hit; 11: randomly pick one stack S hp ; 12: r[j] CR[hp]; 13: pop all the elements in the stack S hp ; 14: remove variable j from the other k 1 hit stacks; 15: regbinds++; 16: end if 17: end if 18: end for 19: output the total number of register bindings regbinds; We then scan the time points from the starting point At each define time, we push the variable v j (represented by the host generic register which it is originally allocated to) to each stack as a candidate for custom registers (lines 4-6) At each expiration time, we search all the stacks for v j, collect the hit stacks S h1,, S hk, and randomly choose one stack S hp, which represents custom register CR hp, for v j (lines 7-12) All the candidate variables stored in stack S hp will be popped out, because they all conflict with variable v j Meanwhile, because v j has been allocated to CR hp, it has to be removed from the other hit stacks to prevent duplicated allocation (lines 13-15) The linear scan register allocation algorithm is very efficient For a program, the computation complexity is n m N, where n is the total number of spill variables, m denotes the number of custom registers available, and N represents the number of generic registers (eg, N=32) Figure 6 gives a snapshot of the register binding mechanism at the time point of t 3, which is the expiration time of variable v1 v1 is originally held in a generic register R 2, and it is selected to be allocated to CR1 as well to eliminate one register spill Stack 1 has to be emptied for other spill variables whose define time are later than t 3 Meanwhile, v1 should be removed from stacks 2 and 3 With the spill variables selected for custom register binding, the changes described in Figure 3 can be easily applied to the associated spill instructions The custom register binding issue can also be addressed before the code generation step, where the compiler has to be modified to consider a total number of N + m registers for allocation Currently, our approach does not consider the intervals of custom instructions using the custom registers explicitly 1 Our linear scan register allocation algorithm can be easily tweaked to accommondate those intervals In Figure 6, we can add m dummy custom registers along the Y axis, and sort their associated intervals of variables At the define and expiration time of a spill variable, the algorithm has to first check whether the spill variable conflicts 1 Note that here the interval represents the real lifetime of a variable needed by a custom instruction with those variables which are needed by custom instructions and have been assigned to the custom register The spill variable will not be a candidate for the custom register if there is a conflict (ie, not being pushed onto the stack) Similar linear complexity will be achieved for the algorithm 33 Evaluation of the Approach We evaluate the impact on dynamic program execution of our approach We i
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks