Word Search

A framework for rapid evaluation of heterogeneous 3-D NoC architectures

Description
A framework for rapid evaluation of heterogeneous 3-D NoC architectures
Categories
Published
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A framework for rapid evaluation of heterogeneous 3-D NoCarchitectures Efstathios Sotiriou-Xanthopoulos, Dionysios Diamantopoulos, Kostas Siozios ⇑ , George Economakos,Dimitrios Soudris School of Electrical and Computer Engineering, National Technical University of Athens, 9 Heroon Polytechneiou, Zographou Campus, 15780 Athens, Greece a r t i c l e i n f o  Article history: Available online xxxx Keywords: Heterogeneous NoC3-D integrationExploration frameworkCAD tool a b s t r a c t The scalability of communication infrastructure in modern Integrated Circuits (ICs) becomes a challeng-ing issue, which might be a significant bottleneck if not carefully addressed. Towards this direction, theusage of Networks-on-Chip (NoC) is a preferred solution. In this work, we propose a software-supportedframework for quantifying the efficiency of heterogeneous 3-D NoC architectures. In contrast to existingapproaches for NoC design, the introduced heterogeneous architecture consists of a mixture of 2-D and 3-D routers, which reduces the delay and power consumption with a slight impact on packet hops. Morespecifically, the experimental results with a number of DSP applications show the effectiveness of theintroduced methodology, as we achieve on average 25% higher maximum operation frequency and39% lower power consumption compared to the uniform 3-D NoCs.   2013 Elsevier B.V. All rights reserved. 1. Introduction Historically, computation has been expensive and communica-tion cheap. However, with the advance of technology scaling, thischanged. More specifically, last years computation is becomingever cheaper, while communication encounters fundamental phys-ical limitations such as time-of-flight of electrical signals, poweruse in driving long wires/cables, etc. In comparison with off-chip,on-chip communication is significantly cheaper; thus the shift tosingle-chip systems has relaxed many communication problems.Although the shared bus is a simple interface since it is built onwell-understood concepts and it is easy to model, on-chip wiresdo not scale in the same manner as transistors do, and conse-quently the cost gap between computation and communicationis widening. This problem becomes far more savage especially inhighly interconnected (multi-core) systems. In contrast, Network-on-Chip(NoC)technologyis a relativelynewapproach thatenablesnot only more efficient interconnects but also more effective de-sign and verification processes for modern MPSoCs [5].Due to the importance of this interconnection paradigm,researchers spent effort on proposing methodologies and toolsfor customizing both the router, as well as the entire NoC architec-ture [6,7,15,20,21]. Moreover, recently, there are also attempts toemploy advanced process technologies for improving further theperformance of NoC platforms. Among others, three-dimensional(3-D) integration, which enables stacking of multiple die on thevertical axis and interconnecting them using very fine-pitchThrough-Silicon Vias (TSVs), introduces locality along the  z  -axisenabling on average shorter interconnections between systemcomponents [2].The existing approach on designing 3-D NoCs imposes networksconsisted solely of 3-D routers. Assuming a 3-D mesh topology,these routers, apart from the direct connection to their four neigh-bors assigned to the same layer, also provide connectivity to verti-cally aligned routers (upper and lower layers). Even though such aselection leads to ‘‘uniform’’ underline hardware, however, rarelycan be though as an efficient solution, since it does not take intoaccount application’s requirements for data transfer. Specifically,as a NoC is usually an application-oriented communication infra-structure, careful analysis should be performed for deriving anoptimum architecture. Towards this goal, throughout this paperwe propose the usage of heterogeneous 3-D NoCs, which bettertackles the communicationconstraints posed by the target applica-tions. By the term  heterogeneous  we refer to architectures thatcombine a mixture of 2-D and 3-D routers into a single NoC,whereas the spatial assignment of these routers over the targetarchitecture is defined based on application’s requirements.Even though there are some prior works on designing heteroge-neous 3-D NoCs [2,6,20,21], all of them are based on abstract mod-els and consequently no useful conclusions about their efficiencymight be derived. Specifically, both the NoC’s performance, as wellas its power/energy dissipation, are usually retrieved by countingthe number of hops (connections between adjacent routers) that 0141-9331/$ - see front matter    2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.micpro.2013.09.003 ⇑ Corresponding author. E-mail address:  ksiop@microlab.ntua.gr (K. Siozios).Microprocessors and Microsystems xxx (2013) xxx–xxx Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro Please cite this article in press as: E. Sotiriou-Xanthopoulos et al., A framework for rapid evaluation of heterogeneous 3-D NoC architectures, Microprocess.Microsyst. (2013), http://dx.doi.org/10.1016/j.micpro.2013.09.003  a packethas to traverse in order to be delivered fromsource to des-tination nodes, ignoring about parameters related to physicalimplementation. Also, these approaches do not take into consider-ation constraints posed by the selected 3-D technology, leading tounacceptable architectural solutions (e.g. they impose an excessiveamount of TSVs). On the other hand, the experimentation through-out this work is performed with the usage of a framework based oncommercial tools (C-to-Silicon and SoC Encounter) provided by Ca-dence. Based on our analysis with a number of DSP applications,we found that the introduced heterogeneous 3-D NoC outperformsconventional (i.e. uniform) 3-D NoCs, as it achieves on average 25%higher maximum operation frequency and 39% lower powerconsumption.The rest of the paper is organizedas follows: Section 2 describesthe concept of heterogeneous 3-D NoC architecture. The proposedmethodology for designing such an architecture, as well as the toolframework for performing rapid evaluation of these NoCs, are dis-cussed in Sections 3 and 4, respectively. Section 5 presents a num- ber of qualitative and quantitative results that prove the efficiencyof the introduced solution. Finally, conclusions are summarized inSection 6. 2. Architecture of the proposed interconnection scheme This section introduces the architectural organization of theproposed communication scheme consisted of a mixture of 2-Dand 3-D routers. More specifically, a 2-D router can be used wherean incoming routing track is connected to wires on the same layer( F  s  = 3). The router’s flexibility, denoted as  F  S  , gives the number of directions to which each incoming wire can be connected. Alterna-tively, since a 3-D router also supports connections to the thirddimension (upper and lower layer), the value of router’s flexibilityis equal to five ( F  s  = 5).In order to depict the differences between these two baselinerouters, we assume an application’s task graph (depicted inFig. 1) mapped onto a 3-D chip consisted of five layers. The arrowsin this figure denote communication links across either the hori-zontal or the vertical direction between adjacent routers. Basedon this example, not all of these routers exhibit similar require-ments for data transfer. However, since the design of piece-homo-geneous architecture is much easier and more cost-effective, ascompared to an irregular platform, it is possible to cluster routersinto two main groups: (i) those that have to support connectivityacross the vertical directions (upper and lower layers) and (ii)the rest that provide packet routing exclusively inside the samelayer. Hence, by replacing the routers that belong to the secondgroup with their equivalent 2-D implementations, as it is depictedin Fig. 2(a), we expect a more efficient hardware implementation.One more optimization is feasible, which affects the customiza-tion of 2-D and 3-D routers depending on their spatial assignmentover each layer. Specifically, regarding the 3-D routers assigned tobottom (Layer 1) and top (Layer 5) layers, there is no demand forproviding connectivity to lower and higher layers, respectively,as it is depicted in Fig. 2(b). Similarly, for the routers assigned tothe periphery of each layer, they can be designed with fewer ports,as they have limited number of neighbors. Note that the architec-tural solution discussed throughout this paper provides the maxi-mum possible customization (similar to Fig. 2(b)) in order to derivean application-specific heterogeneous 3-D NoC architecture.Depending on this analysis, apart from the 2-D router (men-tioned as  Router  0 ), we employ three more flavors of 3-D routers:   3-D Router  1 : It supports connectivity from lower to higher layer.Since an incoming packet can be routed to four different direc-tions (note that the input and output ports could not be thesame in order to avoid packet deadlocks), the router’s flexibilityis  F  s  = 4.   3-D Router   2 : It supports connectivity for links that realize con-nections from higher to lower layers. Similar to the previouscase, this router has  F  s  = 4.   3-D Router   3 : It supports connectivity from lower to higher layerand vice versa ( F  s  = 5).  2.1. Designing 2-D and 3-D routers The basic component of the proposed architecture is the NoCrouter. This router was designed in SystemC in order to be highlyconfigurable at compile time. Among others, parameters such asthe number of ports, the phit size (word-length), the flit size, thebuffer sizes for the input and the stalled packets are configurable,through a package file. Additionally, since our framework aims tosupport rapid evaluation of heterogeneous 3-D NoC systems, theproposed architectural solution consisted of a mixture of 2-D and3-D routers (described in SystemC) is fully synthesizable. This isalso a key differentiation against relevant approaches targeting toautomate the physical design of NoCs.Fig. 3 gives the block diagram for the employed 3-D router. Theinput packets from the interface, as well as the correspondingpackets from the attached node, are stored to a buffer of stalledpackets. As only one packet is routed in a clock cycle per outputport, the buffer of stalled packets (SPB) contains those packetswhich could not be routed to the router’s output ports. Eventhough a 2-D router consists of fewer ports (there are no portsfor connectivity to upper/lower layers), the architectural organiza-tion of a 2-D router is similar to the one discussed previously. Theonly differences between these two architectural instantiations af-fect the routing mechanism (since there is no connectivity toupper/lower layers), as well as the size of SPB buffer. Specifically,as the 2-D router has fewer output ports, it is more likely for apacket to be stalled in this router, compared to the corresponding3-D implementation. Hence, the size of SPB buffer for the 2-D rou-ter is 30% bigger compared to the corresponding size of a 3-Drouter.In addition to that, the SystemC model of our router incorpo-rates a flexible Inter-Router Interface (IRI) to prevent manualarchi-tectural modifications in case of a non-valid routing direction (e.g.there is no node attached to a local port, or a router is spatially as-signed to the periphery of a layer). The implementation of IRI isbased on a set of Read/Write C++ methods, one for each direction,which define the connectivity of router’s ports to the neighbor rou-ters. From physical point of view, the IRI provides the appropriatepins for the communication with the neighbor routers, marked as Fig. 1.  Example for an application’s communication graph.2  E. Sotiriou-Xanthopoulos et al./Microprocessors and Microsystems xxx (2013) xxx–xxx Please cite this article in press as: E. Sotiriou-Xanthopoulos et al., A framework for rapid evaluation of heterogeneous 3-D NoC architectures, Microprocess.Microsyst. (2013), http://dx.doi.org/10.1016/j.micpro.2013.09.003  (a)(b) Fig. 2.  Alternative 3-D NoCs for the example discussed in Fig. 1. Fig. 3.  Architectural template for a 3-D router. E. Sotiriou-Xanthopoulos et al./Microprocessors and Microsystems xxx (2013) xxx–xxx  3 Please cite this article in press as: E. Sotiriou-Xanthopoulos et al., A framework for rapid evaluation of heterogeneous 3-D NoC architectures, Microprocess.Microsyst. (2013), http://dx.doi.org/10.1016/j.micpro.2013.09.003   X  _ LEFT  ,  X  _ RIGHT  ,  Y  _ UP  ,  Y  _ DOWN  , as well as the  Z  _  ABOVE   and  Z  _ BE-LOW   (in case of a 3-D router). In case a router is not attached to aneighbor router at a given direction, then the respective methodinvalidates the corresponding Read/Write operations.The selection of the output direction for an incoming packet isdefined from the Routing Decision Mechanism based on a modified  ZXY   algorithm. According to this algorithm, priority is given to thestalled packets that are stored into the router’s local SPB buffer.Since the size of this buffer is limited, the routing algorithm incor-porates a mechanism for preventing the overflow of SPB buffer.More specifically, in case the packet could not be routed towardsits direction with the minimum Manhattan distance and the SPBis full, then this packet is routed to the first unoccupied direction(port) in order to avoid data loss. On the other hand (i.e. whenthere are available unoccupied output ports), then the  z  -axis hashigher priority compared to the packet routing inside the samelayer. Additionally, the employed routing algorithm supports theavoidance of live-locks and dead-locks. In more detail, the avoid-ance of deadlocks is performed by incorporating the Turn Model[22]. Regarding the livelocks, they may occur in networks withnon-minimal routing algorithms, i.e. where packets may followpaths that do not always lead them closer to the destination. Forthe scope of this paper, the employed routers avoid livelocks byprioritizing traffic based on hop counters. For each node a packettraverses, a hop counter is incremented. If several packets arerequesting a channel, the one with the largest hop-counter valueis granted access. This way, packets that have long circled the net-work will receive higher priority and eventually reach the destina-tion. Finally, in order to declare the spatial assignment of eachrouter across the grid of 3-D NoC, we also define its co-ordinatesas a tuple  h  X  , Y  ,  Z  i . The information about the co-ordinates of thedestination router is also found in the header of each packet, asit is depicted in Fig. 4. We have to mention that more flexible rout-ing algorithms than the employed modified  ZXY   algorithm can alsobe supported, if their functionality is appropriately included intothe router’s SystemC description. 3. Proposed methodology  This section describes the proposed methodology for perform-ing rapid evaluation of heterogeneous 3-D NoCs. The steps of thismethodology are depicted schematically in Fig. 5. Initially, a two-step profiling approach is applied to the application’s high-leveldescription (e.g. C/C++) for determining the communication band-width among functionalities mapped onto different nodes. Morespecifically, the first step deals with an algorithmic analysis of the target application (this task is applied manually) in order todetermine the amount of data exchanged between the applica-tion’s kernels per function call, whereas the second step is basedon software tools (for our case we employ the Callgrind tool fromValgrind suite [1]). The scope of this analysis is to find out howmany times each function is called. Then, it is possible to calculatewith acceptable accuracy the total amount of data sent/receivedbetween application’s functionalities.The outcome of this analysis in conjunction with the systemspecifications (e.g. desired throughput, maximum affordablepower/energy dissipation, etc.) are fed as inputs to the topologyexploration tool. At this step different heterogeneous 3-D NoCsare evaluated in order to derive those topologies that belong tothe Pareto-optimal curve. Additionally, as the intra-router commu-nication is far more efficient as compared to the corresponding in-ter-router links, our framework supports the clustering of multiplenodes into a single router.The exploration framework takes into account a number of technology-oriented issues, such as the lack of plethora of TSVs(an array of TSVs occupies a significant part of useful silicon area[3]) and the maximum number of layers for the target 3-D plat-form. By enabling the exploration tool to be aware of these con-straints, it is possible to retrieve only the technologically viablesolutions. The topology exploration procedure is automated witha software tool initially proposed in [17]. However, for the scopeof this work, we have extensively modified the algorithm in orderto take into account the additional connectivity constraints posedby incorporating heterogeneous 3-D NoCs (instead of uniform 3-D NoCs supported by the srcinal version). The derived topologydefines the spatial location of each router over the NoC’s grid, aswell as the type (either 2-D, 3-D with connectivity only to upperlayer, 3-D with connectivity only to lower layer, or fully 3-D) foreach router. By appropriately combining and interconnecting anumber of 2-D and 3-D routers, our framework automatically de-rives a synthezable SystemC model, which describes the heteroge-neous 3-D NoC.After deriving the description of the NoC’s architecture, we pro-ceed with the physical implementation onto the selected 3-D tech-nology. First of all, the SystemC is translated to a synthesizableRegister Transfer Level (RTL) description with the usage of aHigh-Level Synthesis (HLS) tool. Regarding this paper, the HLS taskis automated with C-to-Silicon compiler [16]. The rest steps of theintroduced methodology deal with the architecture’s synthesis, aswell as the quantification of numerous performance metrics for thetarget 3-D system implementation. Due to the importance of thisstep, the next section provides additional details about how wequantify a 3-D design with existing 2-D Cadence tools. This featureis one of the major contributions discussed throughout this paper,since the usage of commercial tools guarantees that the derived re-sults are much more accurate as compared to similar approaches,which are based usually on simplified models. 4. 3-D integration This section describes the introduced framework for performingrapid evaluation of 3-D NoCs. During this step all the Fig. 4.  Structure of packets for our proposed NoC architecture.4  E. Sotiriou-Xanthopoulos et al./Microprocessors and Microsystems xxx (2013) xxx–xxx Please cite this article in press as: E. Sotiriou-Xanthopoulos et al., A framework for rapid evaluation of heterogeneous 3-D NoC architectures, Microprocess.Microsyst. (2013), http://dx.doi.org/10.1016/j.micpro.2013.09.003  platform-dependent decisions are made. Among others, these deci-sions include the way the system’s architecture is partitioned, theIP block to layer assignment, the selection of interlayer intercon-nection technology, etc. By distinguishing these platform-depen-dent decisions from the pure physical prototyping step, we cansupport the design of 3-D stacks comprising heterogeneous layers.Additionally, the introduced 3-D integration task consists of mod-ular steps in order to enable interaction with tools from similarand/or complementary flows. More specifically, the steps of thisframework are summarized as follows:   Pre-processing step : Verifies the functional integrity for thedesign and extracts its XML description.   3-D stack generation : Generates the 3-D stack and determinesthe communication (connectivity) among layers.   3-D system prototyping  : Performs the physical implementationof 3-D NoC and evaluates the derived solution. 4.1. Pre-processing step The first step in our methodology is depicted schematically inFig. 6. Initially, the architecture’s SystemC description is simulatedunder various parameters and constraints (e.g. clock period, on-chip memories organization) in order to verify the system’s func-tionality. Then, we determine the desired hierarchy for the target3-D architecture. Different levels of hierarchy are possible to behandled by our framework, each of which has advantages and dis-advantages. For instance, a block-based system’s description leadsto a coarse-grain solution, whereas a gate-level netlist comes witha finer system implementation. In other words, the fine-grain ap-proach imposes the highest performance enhancement, but it alsointroduces the maximum computational complexity forperforming architecture-level exploration. Regarding the 3-D stackdiscussed throughout this paper we maintain the system’s hierar-chy between routers, while each of them is flattened in order tomaximize timing and power savings. These gains are feasible sincethe employed selection takes into consideration the regularity im-posed by the underlined mesh topology.After defining the SoC’s hierarchy, the SystemC description issynthesized with  Design Compiler  . As long as the design constraints(e.g. timing slacks, DRC’s, etc.) are met, the output of synthesis istranslated to an equivalent XML description. This task is softwaresupported by our new publicly available tool, named  Net2XML . 4.2. 3-D stack generation The XML description derived from the previous step representsthe SoC’s netlist after technology synthesis. This description is fedas input to the second step of our proposed framework depictedschematically in Fig. 7.Initially, application is partitioned into a number of subsets. Thegoal at this level affects the minimization of connections betweenpartitions, while respecting some constraints, like keeping DRAMand logic on different partitions. Previous studies showed thatthe partitioning algorithm exhibits increased flexibility wheneverthe number of subsets is higher as compared to the correspondingnumber of device layers [4,8]. The derived subsets are then as-signed to the layers of the target 3-D architecture. During this task,both fabrication and cost parameters are taken into account. Morespecifically, for a given layer, only technology-compatible Fig. 5.  Proposed methodology for evaluating 3-D NoCs. Hypergraph extraction from design (Net2XML)Input: - System’s RTL description Output: - Functional integrity for the NoC - System’s XML description SimulationSynthesis (Design Compiler) Fig. 6.  Tasks for the pre-processing step. Input: - Number of layers - 3-D bonding technology Output: - 3-D stack - Communication among layers System Partitioning (Tabu algorithm)Partition to Layer Assignment (Tabu algorithm)Layer Ordering (Tabu algorithm)Assign TSVs to buses (XML2Net)   p  a  r   t   i   t   i  o  n   i  n  g   S   t  a  c   k  g  e  n  e  r  a   t   i  o  n Update the RLC values for TSV networks (XML2Net)Form TSV networks (XML2Net)   a   d   d   i   t   i  o  n  a   l   i  m  p  r  o  v  e  m  e  n   t  s Fig. 7.  Tasks for 3-D stack generation. E. Sotiriou-Xanthopoulos et al./Microprocessors and Microsystems xxx (2013) xxx–xxx  5 Please cite this article in press as: E. Sotiriou-Xanthopoulos et al., A framework for rapid evaluation of heterogeneous 3-D NoC architectures, Microprocess.Microsyst. (2013), http://dx.doi.org/10.1016/j.micpro.2013.09.003
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks