A Customized CrossBar for DataShufﬂing inDomainSpeciﬁc SIMD Processors
Praveen Raghavan
1
,
2
, Satyakiran Munaga
1
,
2
, Estela Rey Ramos
1
,
3
,Andy Lambrechts
1
,
2
, Murali Jayapala
1
,Francky Catthoor
1
,
2
, and Diederik Verkest
1
,
2
,
4
1
IMEC vzw, Kapeldreef 75, Heverlee, Belgium  3001
{
ragha, satyaki, reyramos, lambreca, jayapala, catthoor,verkest
}
@imec.be
2
ESAT, Kasteelpark Arenberg 10, K. U. Leuven, Heverlee, Belgium3001
3
Electrical Engineering, Universidade de Vigo, Spain
4
Electrical Engineering, Vrije Universiteit Brussels, Belgium
Abstract.
Shufﬂe operations are one of the most common operations in SIMDbased embedded system architectures. In this paper we study different familiesof shufﬂe operations that frequently occur in embedded applications running onSIMDarchitectures. Theseshufﬂeoperations areused todrivethedesign ofacustom shufﬂer for domainspeciﬁc SIMD processors. The energy efﬁciency of various crossbar based custom shufﬂers is analyzed and compared with the widelyused full crossbar. We show that by customizing the crossbar to implement speciﬁc shufﬂe operations required in the target application domain, we can reducethe energy consumption of shufﬂe operations by up to 80%. We also illustrate thetradeoffs between ﬂexibility and energy efﬁciency of custom shufﬂers and showthat customization offers reasonable beneﬁts without compromising the ﬂexibility required for the target application domain.
1 Introduction
Dueto a growingcomputationalanda strict low cost requirementin embeddedsystems,there has been a trend to move toward processors that can deliver a high throughput(MIPS) at a high energy efﬁciency (MIPS/mW). Applicationdomain speciﬁc processors offer a good tradeoff between energy efﬁciency and ﬂexibility required in embedded system implementations. One of the most effective ways to improve energyefﬁciency in datadominated application domains such as multimedia and wireless, isto exploit the datalevel parallelism available in these applications [1,2]. SIMD exploitsdatalevel parallelism at operation or instruction level. Prime illustrations of processorsusing SIMD are [3,4,5], Altivec [6], SSE2 [7] etc.When embedded applications like SDR (software deﬁned radio), MPEG2 etc., aremapped on these SIMD architectures, one of the bottlenecks, both in terms of powerand performance, are the shufﬂe operations. When an application like GSM Decoding using Viterbi is mapped on Altivec based processors, 30% of all instructions areshufﬂes [8]. Functional unit which can perform these shufﬂe operations, known as
P. Lukowicz, L. Thiele, and G. Tr¨oster (Eds.): ARCS 2007, LNCS 4415, pp. 57–68, 2007.c
SpringerVerlag Berlin Heidelberg 2007
58 P. Raghavan et al.
shufﬂer or permutation unit, is usually implemented as a full crossbar, which requires alarge amount of interconnect. It has been shown in [9] that interconnect will be one of the most dominant parts of the delay and energy consumption in future technologies
1
.Hence it is important to minimize the interconnect requirement of shufﬂers to improvethe energy efﬁciency of future SIMDbased architectures.Implementing a shufﬂer as a full crossbar offers extreme ﬂexibility (in terms of varieties of shufﬂe operations that can be performed), but such a ﬂexibility often is notneeded for the applications at hand. Only a few speciﬁc sequences of shufﬂe operations occur in embedded systems and the knowledge of these patterns can be exploitedto customize the shufﬂer and thus to improve its energy efﬁciency. To the best of ourknowledge, there is no prior art that explores different shufﬂe operations in
embedded systems
and exploits these patterns to design energyefﬁcient shufﬂers.In this paper, we ﬁrst study different families of shufﬂe operations or patterns thatoccurmost frequentlyin embeddedapplicationdomains,such as wireless and multimedia, and later use them to customize crossbar based shufﬂer. Customization exploits thefact that shufﬂe operations of target application domains does not require all inputs berouted to all outputs, which is the case in full crossbar, and thus reduces both the logicand interconnect complexity.This paper is organized as follows: Section 2 gives a brief overview of related work on shufﬂe networks in both the networking and SIMD processor domain. Section 3describes different shufﬂe operations that occur in embedded systems. Section 4 showshow crossbar can be customized for required shufﬂe operations and to what extent suchcustomization can help. Section 5 presents experimental results of custom shufﬂers fordifferent datapath and subword sizes. Finally we conclude the paper in section 6.
2 Related Work
A large body of work exists for different shufﬂe networks in the domain of networking switches and NetworkonChips[10]. These networks consists of differentswitcheslike Crossbar, Benes, Banyan, Omega, Cube etc. These switches usually have only afew crosspoints, as the ﬂexibility that is needed for NoC switches is quite low. Whena large amount of ﬂexibility is needed, a crossbar based switch is used. Research like[11,12,13,14,15,16] illustrates the exploration space of different switches for these networks. In case of network switches, the path of the packet from input to output is
arbitrary
as communication can exist between any processing elements. Therefore theknowledge of the application domain cannot be exploited to customize it further. Incase of networks also, other metrics like bandwidth, latency, are important and hencethe optimizations are different.Other related work exists in the area of data shufﬂe networks for ASICs. Work like[17,18] and [19], which customize different networks for performing speciﬁc applications, like FFT butterﬂies, cryptographic algorithms etc. [20] customize the shufﬂenetwork for linear convolution.They are too speciﬁc to be used in a programmableprocessor and none of them have focused on power or energy consumption. To the best of
1
In our experiments using 130nm technology, we observe that roughly 80% of the crossbardynamic power consumption is due to intercell interconnect.
A Customized CrossBar for DataShufﬂing in DomainSpeciﬁc SIMD Processors 59
our knowledge, there is no work which explores the energy efﬁciency of shufﬂe networks for SIMD embeddedsystems. The crossbar is picked over other shufﬂe networks(like Benes, Banyan etc.) as it can perform all kinds of shufﬂe operations. Also the dataroutingfrominputs to outputsis straightforwardwhicheases the controlword(orMUXselection) generation, design veriﬁcation, and design upgrades
2
.
Table 1.
Different Shufﬂe Families for a 64bit Datapath and 8bit Subword Conﬁguration thatoccur in Embedded Systems. ‘;’ denotes the end of one shufﬂe operation. ‘

’ denotes the end of one output in case of a two outputs family. ‘’ denotes a don’t care.
Family Name Occurs in Domain Description Shufﬂe Operations64 m8 O1 F FFT Wireless FFT Butterﬂies
a
0
b
0
a
2
b
2
a
4
b
4
a
6
b
6
;
a
1
b
1
a
3
b
3
a
5
b
5
a
7
b
7
a
0
a
1
b
0
b
1
a
4
a
5
b
4
b
5
;
a
2
a
3
b
2
b
3
a
6
a
7
b
6
b
7
a
0
a
1
b
0
b
1
a
2
a
3
b
2
b
3
;
a
4
a
5
b
4
b
5
a
6
a
7
b
6
b
7
a
0
a
1
a
2
a
3
b
0
b
1
b
2
b
3
;
a
4
a
5
a
6
a
7
b
4
b
5
b
6
b
7
64 m8 O1 F GSM Wireless GSM Decode
a
0
a
2
a
4
a
6
b
0
b
2
b
4
b
6
;
a
1
a
3
a
5
a
7
b
1
b
3
b
5
b
7
(Viterbi)
a
0
a
1
a
0
a
1
b
0
b
1
b
0
b
1
;
a
1
a
0
a
1
a
0
b
1
b
0
b
1
b
0
a
2
a
3
a
2
a
3
b
2
b
3
b
2
b
3
;
a
3
a
2
a
3
a
2
b
3
b
2
b
3
b
2
a
4
a
5
a
4
a
5
b
4
b
5
b
4
b
5
;
a
5
a
4
a
5
a
4
b
5
b
4
b
5
b
4
a
6
a
7
a
6
a
7
b
6
b
7
b
6
b
7
;
a
7
a
6
a
7
a
6
b
7
b
6
b
7
b
6
64 m8 O1 F Broadcast Multimedia Broadcast
a
0
a
0
a
0
a
0
a
0
a
0
a
0
a
0
;
a
1
a
1
a
1
a
1
a
1
a
1
a
1
a
1
for masking
a
2
a
2
a
2
a
2
a
2
a
2
a
2
a
2
;
a
3
a
3
a
3
a
3
a
3
a
3
a
3
a
3
a
4
a
4
a
4
a
4
a
4
a
4
a
4
a
4
;
a
5
a
5
a
5
a
5
a
5
a
5
a
5
a
5
a
6
a
6
a
6
a
6
a
6
a
6
a
6
a
6
;
a
7
a
7
a
7
a
7
a
7
a
7
a
7
a
7
64 m8 O1 F DCT Multimedia DCT
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
;
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
a
0
a
1
b
0
b
1
a
2
a
3
b
2
b
3
;
a
4
a
5
b
4
b
5
a
6
a
7
b
6
b
7
64 m8 O1 F Interleave Multimedia Interleaving two inputs
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
;
a
1
b
1
a
2
b
2
a
3
b
3
a
4
b
4
and Wireless
a
2
b
2
a
3
b
3
a
4
b
4
a
5
b
5
;
a
3
b
3
a
4
b
4
a
5
b
5
a
6
b
6
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
;64 m8 O1 F Filter Multimedia Filtering, Correlators,
a
1
a
2
a
3
a
4
a
5
a
6
a
7
b
0
;
a
2
a
3
a
4
a
5
a
6
a
7
b
0
b
1
and Wireless Crosscorrelator
a
3
a
4
a
5
a
6
a
7
b
0
b
1
b
2
;
a
4
a
5
a
6
a
7
b
0
b
1
b
2
b
3
a
5
a
6
a
7
b
0
b
1
b
2
b
3
b
4
;
a
6
a
7
b
0
b
1
b
2
b
3
b
4
b
5
a
7
b
0
b
1
b
2
b
3
b
4
b
5
b
6
;64 m8 O2 F FFT Wireless Two adjacent
a
0
b
0
a
2
b
2
a
4
b
4
a
6
b
6

a
1
b
1
a
3
b
3
a
5
b
5
a
7
b
7
FFT butterﬂies
a
0
a
1
b
0
b
1
a
4
a
5
b
4
b
4

a
2
a
3
b
2
b
3
a
6
a
7
b
6
b
7
a
0
a
1
b
0
b
1
a
2
a
3
b
2
b
3

a
4
a
5
b
4
b
5
a
6
a
7
b
6
b
7
a
0
a
1
a
2
a
3
b
0
b
1
b
2
b
3

a
4
a
5
a
6
a
7
b
4
b
5
b
6
b
7
3 Shufﬂe Families
A shufﬂe operation takes two input words and produces one or two outputs with the required composition of input subwords,which is represented by the control or selectionlines. The choice of two outputs has both advantages and disadvantages on the processor architecture. The usage of two output based shufﬂe unit implies that lower number
2
The instructions and their encoding remain the same, even when the shufﬂer speciﬁcation (interms of set speciﬁc shufﬂe operations to be implemented) changes during the design process,as long as the encoding of MUX selection lines remains unchanged in the customization.
60 P. Raghavan et al.
of instructions are required for performing the shufﬂes required for an application, butat the cost of increased control overhead. The two output shufﬂe would also requirethat shufﬂer uses two ports of the register ﬁle to write back the results. In this paper wepresent both a single output shufﬂer as well as two output based shufﬂers. But furthurdetails on the implications of using one or two output based shufﬂer unit on the fullsystem is beyond the scope of this paper. The required shufﬂe operations vary acrossapplicationkernels,subwordsizes, and datapathsizes. To illustrate the differentshufﬂeoperations, we ﬁrst introduce a set of deﬁnitions:
–
Shufﬂe Operation
: For a given set of subword organizedinputs, a particular outputsubword organization.
–
Family
: A set of closely related shufﬂe operations that are used in an applicationkernel for given subword and datapath sizes
–
Datapath/Word size
: The total number of bits the datapath operates on at a giventime.
–
Subword Size
: The size of an atomic data element e.g 8bit and 16bit.The differentfamilies use the following namingconvention:
(DatapathSize) m(Subword Size) O(# of Outputs) F Type
. For example
128 m8 O2 F FFT
is a collection of shufﬂe operations required by an “FFT” kernel operating on 8bit size data elementsand implemented on a datapath of size 128bit.
3.1 Families of Shufﬂe Operations
1.
FFT
: The FFT family includes all the butterﬂy shufﬂe operations that are neededfor performing an FFT.2.
Interleave
: The Interleave family includes the shufﬂe operations required for interleaving the two inputs words in different ways.3.
Filter
: The Filter family includes the shufﬂe operations required to performvariousﬁlter operations, correlators and crosscorrelators.4.
Broadcast
:TheBroadcastfamilyincludestheshufﬂeoperationsrequiredforbroadcasting a single subword into all the subword locations.5.
GSM
: The GSM family includes the shufﬂe operations required for the differentoperations during the Viterbi based GSM decoding.6.
DCT
: The DCT family includes the shufﬂe operations required for performing atwodimensional DCT operation.Table 1 shows the shufﬂe operations requiredby the aforementionedapplication kernels operating on 8bit subwords and implemented on a 64bit datapath. The tablealso indicates the domain in which these shufﬂe operations occur. It is assumed thatthe two inputs to the shufﬂer are two words
a
0
a
1
a
2
a
3
a
4
a
5
a
6
a
7
and
b
0
b
1
b
2
b
3
b
4
b
5
b
6
b
7
respectively,whereeachofthese
a
0
to
b
7
aresubwordsofsize8bit.Similarlytheoperations that correspond to other datapath sizes and subword modes can be derived. Thetwooutput(
O2
) shufﬂe operationsare similar to the oneoutput(
O1
) shufﬂe operationsexcept that they performtwo consecutivepermutationsthat are neededby the algorithmsimultaneously. For example in case of the FFT, two butterﬂies that are needed in thesame stage are done together. As the shufﬂe operation for twooutput operation can be
A Customized CrossBar for DataShufﬂing in DomainSpeciﬁc SIMD Processors 61
obtainedby concatenatingtwo adjacentshufﬂe operationsof one outputoperation,onlyone example is shown in the table.
4 Crossbar Customization
Figure 1 shows a typical fullcrossbar implementation, where all the inputs are connected to all the outputs. Used in a 32bit datapath, it can perform all varieties of oneoutput shufﬂe operations with both 8bit and 16bit subwords. The hardware requiredto implement this is four 8bit 8:1 multiplexers (MUXes) and the interconnectionsfromthe different subword inputs to the MUXes. It is clear that this is extremely ﬂexible,but requires a large amount of interconnect. Therefore the power consumption of thisfullcrossbar implementation is extremely high
3
.
8−bit
a0 a1 a3 b1b0 b2 b3
32−bit
Input Word 2Input Word 1Output Word
a2
Fig.1.
Full Crossbar with two inputs and one output
If a shufﬂeris neededthat can implementjust those shufﬂeoperationsrepresentedbythe family
32 m8 O1 F FFT
, which are shown in Table 2. From the table it is evidentthat in such a design not all inputs are required to be routed to each of the outputs.E.g., ﬁrst subword outputMUX requires inputs
a
0
,
a
1
, and
a
2
only.Figure 2 shows thecustomizedcrossbarwhichcanimplementtheshufﬂeoperationsofTable2.Thus,givena set of shufﬂe operations/families that is required, corresponding customized crossbarcan be instantiated by removing the unused input connections to each of the outputmuxes. This reduces both the MUX and the interconnect complexity. We still retainthe encoding of MUX selection signals of the crossbar for design simplicity reasons.It should be noted that further energy savings can be achieved by choosing optimalencodingfor selection lines (potentiallydifferent encodingacross MUXes), but it is notexplored in this work.
3
In our experiments we observed that a shufﬂe operation on this implementation consumesnearly the same amount of dynamic energy as that of a 32bit add operation.