Entertainment & Humor

A Customized Cross-Bar for Data-Shuffling in Domain-Specific SIMD Processors

A Customized Cross-Bar for Data-Shuffling in Domain-Specific SIMD Processors
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Customized Cross-Bar for Data-Shuffling inDomain-Specific SIMD Processors Praveen Raghavan 1 , 2 , Satyakiran Munaga 1 , 2 , Estela Rey Ramos 1 , 3 ,Andy Lambrechts 1 , 2 , Murali Jayapala 1 ,Francky Catthoor 1 , 2 , and Diederik Verkest 1 , 2 , 4 1 IMEC vzw, Kapeldreef 75, Heverlee, Belgium - 3001 { ragha, satyaki, reyramos, lambreca, jayapala, catthoor,verkest } @imec.be 2 ESAT, Kasteelpark Arenberg 10, K. U. Leuven, Heverlee, Belgium-3001 3 Electrical Engineering, Universidade de Vigo, Spain 4 Electrical Engineering, Vrije Universiteit Brussels, Belgium Abstract.  Shuffle operations are one of the most common operations in SIMDbased embedded system architectures. In this paper we study different familiesof shuffle operations that frequently occur in embedded applications running onSIMDarchitectures. Theseshuffleoperations areused todrivethedesign ofacus-tom shuffler for domain-specific SIMD processors. The energy efficiency of var-ious crossbar based custom shufflers is analyzed and compared with the widelyused full crossbar. We show that by customizing the crossbar to implement spe-cific shuffle operations required in the target application domain, we can reducethe energy consumption of shuffle operations by up to 80%. We also illustrate thetradeoffs between flexibility and energy efficiency of custom shufflers and showthat customization offers reasonable benefits without compromising the flexibil-ity required for the target application domain. 1 Introduction Dueto a growingcomputationalanda strict low cost requirementin embeddedsystems,there has been a trend to move toward processors that can deliver a high throughput(MIPS) at a high energy efficiency (MIPS/mW). Application-domain specific proces-sors offer a good trade-off between energy efficiency and flexibility required in em-bedded system implementations. One of the most effective ways to improve energyefficiency in data-dominated application domains such as multimedia and wireless, isto exploit the data-level parallelism available in these applications [1,2]. SIMD exploitsdata-level parallelism at operation or instruction level. Prime illustrations of processorsusing SIMD are [3,4,5], Altivec [6], SSE2 [7] etc.When embedded applications like SDR (software defined radio), MPEG2 etc., aremapped on these SIMD architectures, one of the bottlenecks, both in terms of powerand performance, are the shuffle operations. When an application like GSM Decod-ing using Viterbi is mapped on Altivec based processors, 30% of all instructions areshuffles [8]. Functional unit which can perform these shuffle operations, known as P. Lukowicz, L. Thiele, and G. Tr¨oster (Eds.): ARCS 2007, LNCS 4415, pp. 57–68, 2007.c  Springer-Verlag Berlin Heidelberg 2007  58 P. Raghavan et al. shuffler or permutation unit, is usually implemented as a full crossbar, which requires alarge amount of interconnect. It has been shown in [9] that interconnect will be one of the most dominant parts of the delay and energy consumption in future technologies 1 .Hence it is important to minimize the interconnect requirement of shufflers to improvethe energy efficiency of future SIMD-based architectures.Implementing a shuffler as a full crossbar offers extreme flexibility (in terms of va-rieties of shuffle operations that can be performed), but such a flexibility often is notneeded for the applications at hand. Only a few specific sequences of shuffle opera-tions occur in embedded systems and the knowledge of these patterns can be exploitedto customize the shuffler and thus to improve its energy efficiency. To the best of ourknowledge, there is no prior art that explores different shuffle operations in  embedded systems  and exploits these patterns to design energy-efficient shufflers.In this paper, we first study different families of shuffle operations or patterns thatoccurmost frequentlyin embeddedapplicationdomains,such as wireless and multime-dia, and later use them to customize crossbar based shuffler. Customization exploits thefact that shuffle operations of target application domains does not require all inputs berouted to all outputs, which is the case in full crossbar, and thus reduces both the logicand interconnect complexity.This paper is organized as follows: Section 2 gives a brief overview of related work on shuffle networks in both the networking and SIMD processor domain. Section 3describes different shuffle operations that occur in embedded systems. Section 4 showshow crossbar can be customized for required shuffle operations and to what extent suchcustomization can help. Section 5 presents experimental results of custom shufflers fordifferent datapath and sub-word sizes. Finally we conclude the paper in section 6. 2 Related Work A large body of work exists for different shuffle networks in the domain of network-ing switches and Network-on-Chips[10]. These networks consists of differentswitcheslike Crossbar, Benes, Banyan, Omega, Cube etc. These switches usually have only afew cross-points, as the flexibility that is needed for NoC switches is quite low. Whena large amount of flexibility is needed, a crossbar based switch is used. Research like[11,12,13,14,15,16] illustrates the exploration space of different switches for these net-works. In case of network switches, the path of the packet from input to output is  ar-bitrary  as communication can exist between any processing elements. Therefore theknowledge of the application domain cannot be exploited to customize it further. Incase of networks also, other metrics like bandwidth, latency, are important and hencethe optimizations are different.Other related work exists in the area of data shuffle networks for ASICs. Work like[17,18] and [19], which customize different networks for performing specific appli-cations, like FFT butterflies, cryptographic algorithms etc. [20] customize the shufflenetwork for linear convolution.They are too specific to be used in a programmablepro-cessor and none of them have focused on power or energy consumption. To the best of  1 In our experiments using 130nm technology, we observe that roughly 80% of the crossbardynamic power consumption is due to inter-cell interconnect.  A Customized Cross-Bar for Data-Shuffling in Domain-Specific SIMD Processors 59 our knowledge, there is no work which explores the energy efficiency of shuffle net-works for SIMD embeddedsystems. The crossbar is picked over other shuffle networks(like Benes, Banyan etc.) as it can perform all kinds of shuffle operations. Also the dataroutingfrominputs to outputsis straightforwardwhicheases the controlword(orMUXselection) generation, design verification, and design upgrades  2 . Table 1.  Different Shuffle Families for a 64-bit Datapath and 8-bit Sub-word Configuration thatoccur in Embedded Systems. ‘;’ denotes the end of one shuffle operation. ‘ | ’ denotes the end of one output in case of a two outputs family. ‘-’ denotes a don’t care. Family Name Occurs in Domain Description Shuffle Operations64 m8 O1 F FFT Wireless FFT Butterflies  a 0 b 0 a 2 b 2  a 4 b 4 a 6 b 6 ;  a 1 b 1 a 3 b 3  a 5 b 5 a 7 b 7 a 0 a 1 b 0 b 1  a 4 a 5 b 4 b 5 ;  a 2 a 3 b 2 b 3  a 6 a 7 b 6 b 7 a 0 a 1 b 0 b 1  a 2 a 3 b 2 b 3 ;  a 4 a 5 b 4 b 5  a 6 a 7 b 6 b 7 a 0 a 1 a 2 a 3  b 0 b 1 b 2 b 3 ;  a 4 a 5 a 6 a 7  b 4 b 5 b 6 b 7 64 m8 O1 F GSM Wireless GSM Decode  a 0 a 2 a 4 a 6  b 0 b 2 b 4 b 6 ;  a 1 a 3 a 5 a 7  b 1 b 3 b 5 b 7 (Viterbi)  a 0 a 1 a 0 a 1  b 0 b 1 b 0 b 1 ;  a 1 a 0 a 1 a 0  b 1 b 0 b 1 b 0 a 2 a 3 a 2 a 3  b 2 b 3 b 2 b 3 ;  a 3 a 2 a 3 a 2  b 3 b 2 b 3 b 2 a 4 a 5 a 4 a 5  b 4 b 5 b 4 b 5 ;  a 5 a 4 a 5 a 4  b 5 b 4 b 5 b 4 a 6 a 7 a 6 a 7  b 6 b 7 b 6 b 7 ;  a 7 a 6 a 7 a 6  b 7 b 6 b 7 b 6 64 m8 O1 F Broadcast Multimedia Broadcast  a 0 a 0 a 0 a 0  a 0 a 0 a 0 a 0 ;  a 1 a 1 a 1 a 1  a 1 a 1 a 1 a 1 for masking  a 2 a 2 a 2 a 2  a 2 a 2 a 2 a 2 ;  a 3 a 3 a 3 a 3  a 3 a 3 a 3 a 3 a 4 a 4 a 4 a 4  a 4 a 4 a 4 a 4 ;  a 5 a 5 a 5 a 5  a 5 a 5 a 5 a 5 a 6 a 6 a 6 a 6  a 6 a 6 a 6 a 6 ;  a 7 a 7 a 7 a 7  a 7 a 7 a 7 a 7 64 m8 O1 F DCT Multimedia DCT  a 0 b 0 a 1 b 1  a 2 b 2 a 3 b 3 ;  a 4 b 4 a 5 b 5  a 6 b 6 a 7 b 7 a 0 a 1 b 0 b 1  a 2 a 3 b 2 b 3 ;  a 4 a 5 b 4 b 5  a 6 a 7 b 6 b 7 64 m8 O1 F Interleave Multimedia Interleaving two inputs  a 0 b 0 a 1 b 1  a 2 b 2 a 3 b 3 ;  a 1 b 1 a 2 b 2  a 3 b 3 a 4 b 4 and Wireless  a 2 b 2 a 3 b 3  a 4 b 4 a 5 b 5 ;  a 3 b 3 a 4 b 4  a 5 b 5 a 6 b 6 a 4 b 4 a 5 b 5  a 6 b 6 a 7 b 7 ;64 m8 O1 F Filter Multimedia Filtering, Correlators,  a 1 a 2 a 3 a 4  a 5 a 6 a 7 b 0 ;  a 2 a 3 a 4 a 5  a 6 a 7 b 0 b 1 and Wireless Cross-correlator  a 3 a 4 a 5 a 6  a 7 b 0 b 1 b 2 ;  a 4 a 5 a 6 a 7  b 0 b 1 b 2 b 3 a 5 a 6 a 7 b 0  b 1 b 2 b 3 b 4 ;  a 6 a 7 b 0 b 1  b 2 b 3 b 4 b 5 a 7 b 0 b 1 b 2  b 3 b 4 b 5 b 6 ;64 m8 O2 F FFT Wireless Two adjacent  a 0 b 0 a 2 b 2  a 4 b 4 a 6 b 6  |  a 1 b 1 a 3 b 3  a 5 b 5 a 7 b 7 FFT butterflies  a 0 a 1 b 0 b 1  a 4 a 5 b 4 b 4  |  a 2 a 3 b 2 b 3  a 6 a 7 b 6 b 7 a 0 a 1 b 0 b 1  a 2 a 3 b 2 b 3  |  a 4 a 5 b 4 b 5  a 6 a 7 b 6 b 7 a 0 a 1 a 2 a 3  b 0 b 1 b 2 b 3  |  a 4 a 5 a 6 a 7  b 4 b 5 b 6 b 7 3 Shuffle Families A shuffle operation takes two input words and produces one or two outputs with the re-quired composition of input sub-words,which is represented by the control or selectionlines. The choice of two outputs has both advantages and disadvantages on the proces-sor architecture. The usage of two output based shuffle unit implies that lower number 2 The instructions and their encoding remain the same, even when the shuffler specification (interms of set specific shuffle operations to be implemented) changes during the design process,as long as the encoding of MUX selection lines remains unchanged in the customization.  60 P. Raghavan et al. of instructions are required for performing the shuffles required for an application, butat the cost of increased control overhead. The two output shuffle would also requirethat shuffler uses two ports of the register file to write back the results. In this paper wepresent both a single output shuffler as well as two output based shufflers. But furthurdetails on the implications of using one or two output based shuffler unit on the fullsystem is beyond the scope of this paper. The required shuffle operations vary acrossapplicationkernels,sub-wordsizes, and datapathsizes. To illustrate the differentshuffleoperations, we first introduce a set of definitions: –  Shuffle Operation : For a given set of sub-word organizedinputs, a particular outputsub-word organization. –  Family : A set of closely related shuffle operations that are used in an applicationkernel for given sub-word and datapath sizes –  Datapath/Word size : The total number of bits the datapath operates on at a giventime. –  Sub-word Size : The size of an atomic data element e.g 8-bit and 16-bit.The differentfamilies use the following namingconvention: (DatapathSize) m(Sub-word Size) O(# of Outputs) F Type . For example  128 m8 O2 F FFT   is a collection of shuffle operations required by an “FFT” kernel operating on 8-bit size data elementsand implemented on a datapath of size 128-bit. 3.1 Families of Shuffle Operations 1.  FFT  : The FFT family includes all the butterfly shuffle operations that are neededfor performing an FFT.2.  Interleave : The Interleave family includes the shuffle operations required for inter-leaving the two inputs words in different ways.3.  Filter  : The Filter family includes the shuffle operations required to performvariousfilter operations, correlators and cross-correlators.4.  Broadcast  :TheBroadcastfamilyincludestheshuffleoperationsrequiredforbroad-casting a single sub-word into all the sub-word locations.5.  GSM  : The GSM family includes the shuffle operations required for the differentoperations during the Viterbi based GSM decoding.6.  DCT  : The DCT family includes the shuffle operations required for performing atwo-dimensional DCT operation.Table 1 shows the shuffle operations requiredby the aforementionedapplication ker-nels operating on 8-bit sub-words and implemented on a 64-bit datapath. The tablealso indicates the domain in which these shuffle operations occur. It is assumed thatthe two inputs to the shuffler are two words  a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7  and  b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 respectively,whereeachofthese a 0  to b 7  aresub-wordsofsize8-bit.Similarlytheoper-ations that correspond to other datapath sizes and sub-word modes can be derived. Thetwo-output( O2 ) shuffle operationsare similar to the one-output( O1 ) shuffle operationsexcept that they performtwo consecutivepermutationsthat are neededby the algorithmsimultaneously. For example in case of the FFT, two butterflies that are needed in thesame stage are done together. As the shuffle operation for two-output operation can be  A Customized Cross-Bar for Data-Shuffling in Domain-Specific SIMD Processors 61 obtainedby concatenatingtwo adjacentshuffle operationsof one outputoperation,onlyone example is shown in the table. 4 Crossbar Customization Figure 1 shows a typical full-crossbar implementation, where all the inputs are con-nected to all the outputs. Used in a 32-bit datapath, it can perform all varieties of one-output shuffle operations with both 8-bit and 16-bit sub-words. The hardware requiredto implement this is four 8-bit 8:1 multiplexers (MUXes) and the interconnectionsfromthe different sub-word inputs to the MUXes. It is clear that this is extremely flexible,but requires a large amount of interconnect. Therefore the power consumption of thisfull-crossbar implementation is extremely high 3 . 8−bit a0 a1 a3 b1b0 b2 b3 32−bit Input Word 2Input Word 1Output Word a2 Fig.1.  Full Crossbar with two inputs and one output If a shuffleris neededthat can implementjust those shuffleoperationsrepresentedbythe family  32 m8 O1 F FFT  , which are shown in Table 2. From the table it is evidentthat in such a design not all inputs are required to be routed to each of the outputs.E.g., first sub-word outputMUX requires inputs  a 0 ,  a 1 , and  a 2  only.Figure 2 shows thecustomizedcrossbarwhichcanimplementtheshuffleoperationsofTable2.Thus,givena set of shuffle operations/families that is required, corresponding customized crossbarcan be instantiated by removing the unused input connections to each of the outputmuxes. This reduces both the MUX and the interconnect complexity. We still retainthe encoding of MUX selection signals of the crossbar for design simplicity reasons.It should be noted that further energy savings can be achieved by choosing optimalencodingfor selection lines (potentiallydifferent encodingacross MUXes), but it is notexplored in this work. 3 In our experiments we observed that a shuffle operation on this implementation consumesnearly the same amount of dynamic energy as that of a 32-bit add operation.
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks