This work was partially supported by the project IST-34793-AMDREL funded by the E.C.
A Novel Coarse-Grain Reconfigurable Data-Path for Accelerating DSP Kernels
 
M.D. Galanis
1
, G. Theodoridis
2
, S. Tragoudas
3
, D. Soudris,
4
 and C.E. Goutis
1 1
VLSI Design Lab., Elect. & Comp. Eng. Dept., Univ. of Patras, Rio 26110, Greece
2
Physics Dept., Aristotle Univ. Thessaloniki 54124, Greece
3
Elect. & Comp. Eng. Dept., Southern Illinois Univ., Carbondale 62901, USA
4
VLSI Design & Testing Center, Elect. & Comp. Eng. Dept., Democritus Univ., Xanthi 67100, Greece mgalanis@vlsi.ee.upatras.gr  
ABSTRACT
In this paper, an efficient implementation of a high  performance coarse-grain reconfigurable data-path on a mixed-granularity reconfigurable platform is presented. It consists of several coarse grain components of the same type, a reconfigurable inter-component network, and a centralized register bank. The universal type of coarse grain component is shown to increase the system’s  performance due to significant reductions in the latency. A flexible interconnection network facilitates the data transfers between the coarse grain components and also from or to the register bank. An automated methodology for mapping DSP and multimedia kernels on the data-path is also presented. Chaining of operations is optimally exploited, and the architecture allows for simple and efficient algorithms for scheduling, live signal reduction, and component binding. Experimental results verify the impact of our architectural decisions and design automation methods.
1.
 
INTRODUCTION
Present and future applications such as interactive multimedia and wireless LAN applications are characterized by high complexity and diverse functionality, while they demand high speeds and low- power consumption. Compared to pure software or hardware implementations, reconfigurable computing offers a plethora of advantages to implement computational intensive kernels of such applications with high performance, low-power consumption, and reduced design time [1]-[3]. Exploiting the special benefits of reconfigurable computing such as the abundant offered spatial parallelism, high computational density, and flexibility of changing the functionality according to application requirements, a lot of reconfigurable architectures and methodologies for mapping computational intensive kernels onto these architectures have been proposed [1]-[8]. However, reconfigurable systems suffer by the large reconfiguration overhead. This is the extra time and energy required during the transition from one reconfiguration state to the next. To overcome this drawback several techniques, which can be grouped in two categories, have been proposed. In the first category lie techniques that aim at developing mixed-granularity reconfigurable architectures. Such hybrid architecture, like the one in [8], consists of fine-grain reconfigurable units, usually implemented in FPGA technology, coarse-grain units implemented either in FPGA or ASIC technology, data-storage memory, and reconfigurable interconnection network. The basic idea is to exploit the frequency of appearance of similar computational structures, which are contained in the considered kernels, in developing configurable coarse-grain units and mapping methodologies to implement these structures. Since the coarse-grain units are permanent onto the hardware, there is no need to configure the corresponding hardware portions. This results in a reduced reconfiguration overhead. If the coarse-grain units are implemented in ASIC technology, a further improvement of the performance is achieved. The second category covers techniques that aim at  performing an efficient temporal partition during the stage of mapping the application onto the reconfigurable hardware [21]. The basic concept is to properly partition the application into a number of segments, which are executed one after another, aiming at optimally exploiting the computational power offered by the hardware in each  partition. Compared to static (one-time) reconfiguration the number of reconfigurations usually increases. However, the performance is improved since in each segment all the computational power offered by the hardware is fully utilized to implement the operations of this segment. In this paper, a high-performance data-path for implementing computational intensive DSP kernels is
 
 
introduced. It consists of a set of coarse-grain components (CGCs) implemented in ASIC technology, a reconfigurable interconnection network, and a centralized register bank. The data-path is a part of general mixed-granularity reconfigurable platform, which includes a microprocessor, fine-grain reconfigurable elements,  program and data memory, and the proposed data-path. A CGC consists of an
mn
×
 array of nodes and appropriate steering logic. Each node contains a multiplier and ALU implemented as pure combinational circuits, while the steering logic allows in easily realizing any desired template; so the system’s performance benefits from the chaining of operations. Flexible interconnection networks allow for direct communication among the CGCs and between the CGCs and the register bank. Also, a methodology for mapping DSP and multimedia kernels on the data-path is presented. Due to the universal and flexible structure of the CGC and the existence of the flexible interconnection networks, the DFGs are realized in reduced latency compared to the case where the DFGs are implemented with ASIC  primitive resources. Considering that the Instruction Level Parallelism (ILP) of typical DFGs for DSP applications is typically a small constant [19], our architecture is implemented using a small number of reconfigurable CGCs. We demonstrate that this suffices to efficiently realize any such DFG with low latency. Our architectural decisions also lead to very simple, yet efficient, algorithms for scheduling, live signal reduction, and CGC binding. The universality and flexibility of the CGC also supports more efficient temporal partitioning. The paper is organized as follows: Section 2 presents the related work, while the coarse-grain reconfigurable data-path architecture is introduced in section 3. The data- path’s architecture is studied and analyzed in section 4, while the synthesis aspects arisen by the proposed architecture are discussed in section 5. In section 6 the experimental results are presented. Finally, conclusions and future work are drawn in section 7.
2.
 
RELATED WORK
There has been considerable research for developing reconfigurable architectures in the past [3]. We are focusing on mixed granularity platforms containing coarse-grain reconfigurable data-paths. The target domain of these  platforms is mainly DSP, multimedia and telecom applications. The most representative works are outlined  below. The processing elements of [4], [5], [18] have internal registers for storing intermediate variables as opposed to the full combinational nature of CGC that is more suitable for improving the system’s performance through latency reduction. The Pleiades [8] architecture is an approach that combines an on-chip microprocessor with a number of heterogeneous reconfigurable units of different granularities connected via a reconfigurable interconnection network. These units are mainly MACs, ALUs, and an embedded FPGA. Although, remarkable results especially in power-consumption have been reported [7], Pleiades suffers from a major drawback. In  particular, no systematic methodology exists for mapping an arbitrary DSP application on a predefined Pleiades- based architecture. Instead, a methodology for deriving the required architecture for a specific application is presented. In contrast the introduced data-path and mapping methodology are characterized by more generality and efficiency, since they allow either to generate a data-path instance for a specific application or to map an application onto predefined data-path architecture. The Strategically Programmable System (SPS) [6] is a reconfigurable system architecture that combines fine-grain reconfigurable units and ASIC coarse-grain modules, which are pre-placed within a fully reconfigurable fabric. These modules called Versatile Parameterizable Blocks (VPBs) are generated by means of template generation [10] from a set of DFGs of the application domain. However, the reported generated templates are not so-flexible enough and this results in a non full and optimal cover of the graph by the VPBs. Since the uncovered operations are implemented by the fine grain units, the system’s performance is reduced. The mixed granularity approach has been recently adopted in commercial FPGA devices, like the Xilinx Virtex-II/Spartan-3 [16] and Altera Stratix [17]. These devices contain ASIC multiplier units, which operate on 18-bits operands and they can be considered as the coarse-grain hardware. However, additional operations such as additions or ALU operations appeared in the majority of the DSP kernels. Since these operations are realized by the fine-grain circuit the performance is reduced compared to a full ASIC implementation as in our case. In Design Automation for High Level Synthesis research activities as in [9], [11], [12] suggest that  primitive resources must be substituted by more complex computational data-path resources. These complex resources, usually called
templates
 or
clusters
, consist of  primitive resources in sequence without intermediate registers between them, to take advantage of the chaining of operations effect. Chaining is the removal of intermediate registers between primitive hardware units and it can improve the total delay of the primitive hardware resources combined, and hence to reduce the critical path in a DFG. Such complex resources can be either obtained from a library of templates [9] or can be automatically extracted from the Intermediate Representation (IR) of an application by a process called template generation [10], [12]. In [9] it is shown that template selection from the predefined library had the largest impact on overall improvement in delay.
 
 
3.
 
RECONFIGURABLE PLATFORM ARCHITECTURE DESCRIPTION
3.1.
 
Overall design
The whole architecture of the considered platform is shown in Fig.1. It contains a microprocessor, fine-grain reconfigurable units, program and shared data memories, and the introduced coarse-grain reconfigurable data-path.
ProgramMemoryMicroprocessor Coarse-grainreconfigurabledata-pathFine-grainreconfigurablelogic blocksShared Data Memory
ControlControl
 
Fig.1: Reconfigurable platform architecture
The microprocessor executes the software parts of the application derived after the hardware/software  partitioning stage. It usually executes the higher layers (e.g. the application layer) of a protocol application or the control and non-computational intensive parts of a DSP application. The processor is also responsible to control and configure the fine-grain reconfigurable units. The fine-grain reconfigurable hardware’s granularity ranges from 1 to 2-bits and it is realized by an embedded FPGA unit. The fine-grain reconfigurable hardware has a triple role. First, it executes small bit-width operations; for instance bit-level operations. A typical example is a scrambler unit, found in most telecommunication systems, where the operations are 1-bit XOR ones. For such kind of operations fine-grain architecture is preferable since the granularity of the Configurable Logic Blocks (CLBs) is typically one or two bits. Second, it may execute exceptional operations; for instance division or square root operations that are not efficiently realized by the coarse-grain units. Finally, it implements the control-unit of the coarse-grain reconfigurable data-path. The coarse-grain portion of the architecture consists of identical CGCs, a centralized register bank, and a reconfigurable network as shown in Fig. 2. To reduce the reconfiguration overhead and improve performance, the CGCs are implemented in ASIC technology. The reconfigurable coarse-grain part is responsible for executing the word-level operations of the computational intensive parts of an application (e.g. DSP and multimedia kernels). The word-width is 16-bits, which is adequate for the majority of these kernels. Each CGC’s node contains a multiplier and an ALU. Each time one of the two operations is selected by appropriate control signals. The connections among the CGCs and between the register  bank and the CGCs are achieved by means of a reconfigurable interconnection network also configured by the control-unit.
Register bankReconfigurable Interconnection NetworkCoarse-graincomponentCoarse-graincomponentCoarse-graincomponent
 
Fig. 2: CGC-based reconfigurable data-path
The computational units of the platform exchange data through the shared data memory as shown in Fig. 1. In a data-intensive application where blocks of data are  processed (e.g. a video applications), the results from the  processing in the coarse-grain part are stored in the shared data memory (e.g. in a FIFO manner) and then can used by the FPGA, which implements other parts of an application. For example, in a JPEG application the coarse-grain reconfigurable data-path executes the DCT on an 8x8 image block and the results are the input to the Huffman encoder, which should be implemented in the fine-grain hardware. To efficiently realize applications by this platform, the development of a methodology for mapping the computational intensive parts on to the CGCs is necessary. The input specification explicitly determines which tasks will be executed in the microprocessor, the fine grain and the CGC-based data-path.
3.2.
 
CGC design
The structure of the proposed CGC is an
n
-
by
-
m
 array of nodes, where
n
 and
m
 are the number of nodes per row and column, respectively. In Fig. 3 such a CGC (called hereafter as
22
×
 CGC) with 2 nodes per row and 2 nodes  per column is illustrated. This will be used to demonstrate the features of the introduced CGC. A
22
×
CGC consists of 4 nodes whose interconnection is shown in Fig. 3, 4 inputs (in1, in2, in3 in4) connected to the centralized register bank, 4 additional inputs (A, B, C, D) connected to the register bank or to another CGC, two outputs (out1, out2) also connected to the register bank and/or to another CGC, and two outputs (out3, out4) whose values are stored in the register bank. Since each internal node performs two-operand operations, multiplexers are used to select the inputs of the nodes of the second row. A detailed structure of the
22
×
 CGC is shown in Fig. 4.
 
 
Node1Node2Node3Node4
In 1In 2In 3In 4Out 1Out 2Out 3Out 4
 A,B,C,D are from register bankor from other componentTo register bankor to another componentTo register bankor to another component
ABCD
Fig. 3: General architecture of a 2x2 CGC
Each node consists of two computational units that are a multiplier and an ALU as shown in Fig. 5. Both these units are implemented in combinational logic to take  benefits of the chaining of operations inside the CGC. The flexible interconnection among the nodes is chosen to allow the control-unit to easily realize any desired hardware template by properly configuring the existing steering logic (i.e. the multiplexers).
Node1Node2Node3Node4
In 1In 2In 3In 4Out 3Out 4
To register bankor to another componentTo register bankor to another component
BCCD
2222
BA
 A,B,C,D comefrom register bankor from other component
 
Fig. 4: Detailed architecture of a 2x2 CGC
The ALU performs shifting, arithmetic and logical operations. Each time either the multiplier or the ALU is activated according to the control signals, Sel1 and Sel2 as shown in Fig. 5. When a node is utilized, then one of the Sel1 and Sel2 signals is set to 1, so as to enable the desired operation. If the node is not utilized both Sel1 and Sel2 are set to 0. A bit-width equal to 16-bit is chosen for each unit  because such a word-length is adequate for many DSP applications. Moreover, multiplication and ALU are selected because DSP kernels mainly consist of these operations. For example, a multiply and accumulate (MAC) operation is the most common operation in this application domain. The same operations have also been chosen in the templates of [9], [10], [12]. A task mapped to the reconfigurable CGC-based data- path should not include exceptional operations such as division or square root. Such operations must be decomposed in sequences of multiplication and ALU operations or otherwise will be executed on fine grain.
 
Buffer 
x
Buffer ALUIn AIn BOutSel 1Sel 2
 
Fig. 5: Node structure
A
mn
×
CGC has an analogous structure. Particularly, the first-row nodes obtain their inputs from the register  bank. All the other nodes obtain their inputs from the register bank and/or a row with a smaller index from the same and/or another CGC. For the case of the outputs, the last-row nodes store the results of their operations to the register bank. All the other nodes give their results to the register bank and/or to another CGC of the data-path. Regarding the direct interconnections among the nodes, only the ones from a row with index
 to a row with index
 with
 <
 are permitted. The coordinate of the upper-left node is (0, 0).
3.3.
 
Reconfigurable interconnection network
The reconfigurable interconnection network is divided in two sub-networks. The first one is used for the communication among the CGCs. The second one is used for the communication of the CGCs with the register bank for storing and fetching data values. The direct connections among the CGCs of the data- path are implemented by a crossbar interconnection network as illustrated in Fig. 6. This network is chosen so as to enable all the possible inter-CGC connections, thus achieving full connectivity. If there are
 N 
 inputs and
 M
outputs there are
 M  N 
 ×
 
switches,
 N
 buses, and
 M 
 loads  per connection.
N inputsM outputsswitch
 
Fig. 6: Crossbar interconnection network
In the reconfigurable coarse-grain data-path full connectivity is provided between the CGCs and the centralized register bank, using crossbar interconnection network. All the CGCs’ outputs can give to any register in the bank and all the CGCs’ inputs can take data from any register in the bank. This eliminates the register allocation
 
 
 problem found in High-Level Synthesis (HLS) systems and the usage of complex graph coloring algorithms used to solve this problem.
4.
 
COARSE-GRAIN RECONFIGURABLE DATA-PATH ANALYSIS
Section 4.1 shows that the clock period overhead, due to the flexible data-path interconnections as well as the intra-CGC interconnections, is not significant. Section 4.2  justifies our decision to implement the control on the fine-grain hardware while insisting on the flexibility of the data-path interconnects and the universality of the used coarse grain type. Section 4.3 describes area and power considerations in the proposed architecture. Finally, Section 4.4 demonstrates that the proposed architectures  provides with better support for temporal partitioning when comparing to the architectural decisions in [9], [10], [12].
4.1.
 
Delay analysis
The delay,
 D
data_path
, of the coarse-grain data-path is the delay of the used processing elements (nodes) plus the delay of the interconnections. If the data-path consists of a number of
mn
×
 identical CGCs and considering that each node can be connected to nodes (of the same or another CGC) with greater index row, the maximum number of direct connected nodes is
n
. Since,
n
 nodes can  be also connected inside a CGC, the delay of the used  processing elements is the delay of each CGC. Therefore, the delay,
 D
data_path
, of the coarse-grain data- path is the intra-CGC delay,
 D
intra_CGC 
,
 plus the delay of the reconfigurable interconnection network,
 D
netw
. The latter is further divided in two CGCs: the delay of the crossbar network connecting the CGCs,
 D
inter_CGC 
,
 and the register transfer delay,
 D
reg_trans
.
transreg CGC er CGC ra netwCGC ra pathdata
 D D D D D D
 _  _ int _ int  _ int _ 
++==+=
 
(1)
 
For the case where the data-path consists of a set of
mn
×
 identical CGCs the intra-CGC delay,
 D
intra_CGC 
, is given by the following formula.
=
+++=
11 _ int
)1:)2((
n BUF  MUX mult CGC ra
nmn D
 (2)
 
where
mult 
 is the delay of the multiplier, ((
m
+2):1)
 MUX 
 is the delay of one (
m
+2)-to-1 multiplexer,
l
is
 
the CGC's row index, and
 BUF 
 is the delay of the tri-state  buffer. Compared with templates consisting of
n
 directed  primitive resources, the extra delay overhead,
over CGC ra
 D
 _ int
, is the delay of the steering logic and buffers. This is given  by the following expression.
=
++=
11 _ int
)1:)2((
n BUF  MUX over CGC ra
nm D
 (3) If a data-path consists of a number of
22
×
 CGCs (
n
=
m
=2), the overhead at the intra-CGC delay is
 BUF  MUX over CGC ra
 D
 +=
2)1:4(
 _ int
, which is a small fraction compared to the operations’ delay (i.e. the delay of multiplier). This extra delay is affordable to take advantage of the simple realization of any required template and the optimal inter-CGC chaining exploitation that the CGC offers. In case of a data-path consists of
 p
 
mn
×
 CGCs, there are
 pmn
 
))1((2 intermediate inputs (all the inputs except the first-row ones) and
 pmn
 
)1( intermediate outputs (all the outputs except the last-row ones). So, for the crossbar network connecting the CGCs there are
 pmn
 
))1((2  buses and
 pmn
 
)1( loads. Considering the structure of the interconnection network, the switches’ delay is caused from the switches in the path from the top left corner input to the down right corner output. For
 p mn
×
 CGCs the number of switches in this  path is: 1)1(3##
 =+
 pmn switchesbuses
(4) The drivers are placed at the intermediate outputs of a CGC, and drive the buses to the intermediate inputs of the other CGCs. The driver’s delay
driver 
 
is proportional to the number of inputs that it drives and to the capacitance of these inputs. Therefore, the
comper 
 D
 _ int
 
delay
 
is:
[ ]
driver  switchCGC er 
 pmn  pmn D
+=
)1(  1)1(3
 _ int
 
(5)
 
where
 switch
 
and
driver 
 are the delays of a switch and driver, respectively. For a data-path consisting of two
22
×
 CGCs (
n
=
m
=
 p
=2), the inter-CGC delay is:
driver  switchCGC er 
 D
 +=
412
 _ int
. This delay is also affordable for taking advantages of covering the whole DFG by the CGC. The average number of ILP in a DFG is small, and thus a small number of CGCs is needed. Since we use either22
×
 or 32
×
 CGCs and the crossbar interconnect delay is small, our architectural decisions do not impact significantly the clock period. The register transfer delay,
 D
reg_trans
, is the summation of the data transfer delay from the register bank to the CGCs,
comptoreg 
 D
 _  _ 
,
 and data transfer delay from the CGC to the register bank,
reg tocomp
 D
 _  _ 
. If the register bank size is set to be
 and there are
 p mn
×
 CGCs
 ,
then the crossbar network that connects the register bank outputs to the CGCs’ inputs has
 pmn
 
2
 buses (equal to the number of nodes’ inputs that require data from the register
of 11