Products & Services

A New Cost-Effective Technique for QoS Support in Clusters

Description
A New Cost-Effective Technique for QoS Support in Clusters
Published
of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  1 A New Cost-effective Technique for QoS Supportin Clusters A. Mart´ınez, F.J. Alfaro, J.L. S´anchez, F.J. Quiles J. Duato Computer Systems Department Dep. of Computer EngineeringUniv. of Castilla-La Mancha Tech. Univ. of Valencia02071 - Albacete, Spain 46071 - Valencia, Spain { alejandro,falfaro,jsanchez,paco } @dsi.uclm.es jduato@disca.upv.es  Abstract —Virtual channels (VCs) are a popular solutionfor the provision of quality of service (QoS). Currentinterconnect standards propose 16 or even more VCs forthis purpose. However, most implementations do not offerso many VCs because it is too expensive in terms of silicon area. Therefore, a reduction of the number of VCsnecessary to support QoS can be very helpful in the switchdesign and implementation.In this paper, we show that this number of VCs canbe reduced if the system is considered as a whole ratherthan each element being taken separately. The schedulingdecisions made at network interfaces can be easily reusedat switches without significantly altering the global behav-ior. In this way, we obtain a noticeable reduction of siliconarea, component count, and, thus, power consumption, andwe can provide similar performance to a more complexarchitecture. We also show that this is a scalable technique,suitable for the foreseen demands of traffic.  Index Terms —Quality of Service, Switch Design,Scheduling, Virtual Channels, Clusters. I. I NTRODUCTION The last decade has witnessed a vast increase in thevariety of computing devices as well as in the numberof users of those devices. In addition to the traditionaldesktop and laptop computers, new handheld deviceslike pocket PCs, personal digital assistants (PDAs), andcellular phones with multimedia capabilities have nowbecome household names.The information and services provided through theInternet rely on applications executed in many serversall around the world. Many of those servers were src-inally based on personal computers (PCs), but the hugeincrement in the number of users worldwide quicklyled to a dramatic increment in the number of clients This work was partly supported by the Spanish CICYT underCONSOLIDER CSD2006-46 and TIN2006-15516-C04-02 grants, byJunta de Comunidades de Castilla-La Mancha under grant PBC-05-005, and by the Spanish State Secretariat of Education andUniversities under FPU grant. concurrently accessing a given server. As a result, thecomputing power and I/O bandwidth offered by a singleprocessor and a few disks were not enough to provide areasonable response time.Clusters of PCs emerged as a cost-effective platformto run those Internet applications and provide serviceto hundreds or thousands of concurrent users. Many of those applications are multimedia applications, whichusually present bandwidth and/or latency requirements[1]. These are known as QoS requirements.In the next section, we will be looking at some propos-als to provide QoS in clusters. Most of them incorporate16 or even more VCs, devoting a different VC to eachtraffic class. This increases the switch complexity andrequired silicon area. Moreover, it seems that, when thetechnology enables it, the trend is to increase the numberof ports instead of increasing the number of VCs per port[2].In most of the recent switch designs, the buffers are themost silicon area consuming part (see [3] for a detaileddesign). The buffers at the ports are usually implementedwith a memory space organized in logical queues. Thesequeues consist of linked lists of packets, with pointers tomanage them. Therefore, the complexity and cost of theswitch heavily depend on the number of queues at theports. For instance, the crossbar scheduler has to consider8 times the number of queues if 8 VCs are implemented(greatly increasing the area and power consumed bythis scheduler). Then, a reduction in the number of VCs (and in the required buffer space) necessary tosupport QoS can be very helpful in the switch designand implementation.In this paper, we show that it is enough to use only twoVCs at each switch port for the provision of QoS. One of these VCs would be used for QoS packets and the otherfor best-effort packets. We also explore a switch designthat takes advantage of this reduction and we evaluateit with realistic traffic models. A preliminary version of this work can be found in [4].  2 Although using just two VCs is not a new idea, thenovelty of our proposal lies in the fact that the globalbehavior of the network is very similar as if it had manymore VCs. This is easily achieved by reusing at theswitches the scheduling made at network interfaces (end-nodes).Simulation results show that our proposal provides avery similar performance compared with a traditionalarchitecture with many more VCs both for the QoStraffic and the best-effort traffic. Moreover, comparingour proposal with a traditional architecture with only 2VCs, our proposal provides a significant improvementin performance for the QoS traffic, while for the best-effort traffic the traditional design is unable to providethe slightest differentiation among packets of the sameVC.The remainder of this paper is structured as follows.In the following section the related work is presented.In Section III, we present thoroughly our strategy toprovide QoS support with two VCs. In sections IV andV, we study in depth the switch architecture we propose.The details on the experimental platform are presentedin Section VI and the performance evaluation in SectionVII. Finally, Section VIII summarizes the results of thisstudy and identifies directions for future research.II. R ELATED WORK The importance of network QoS is widely acceptedby both the research community and the manufacturers.However, the problem is that existing networks are not sowell prepared for the new demands. Implementing QoSis still a very active research topic, with multiple possiblesolutions competing against each other. Depending onthe network architecture, different techniques have to betaken into consideration. Many research efforts are todayperformed around the main aspects related to QoS indifferent environments.As mentioned earlier, the increasing use of the Internetand the appearance of new applications have been thedominant contributions to the need of QoS. For thisreason, it is not surprising that most of the studies arefocused on delivering QoS on the Internet [5], [6]. Many of the services available through the Internet are providedby applications running on clusters. Therefore, the re-searchers are also proposing mechanisms for providingQoS on these platforms, as we will show later.More recently, with the advent of different types of wireless technologies, wireless devices are becomingincreasingly popular for providing the users with Internetaccess. It is possible to transmit data with them but alsovoice, or executing multimedia applications for whichQoS support is essential. The QoS mechanisms proposedfor wired networks are not directly applicable to wirelessnetworks, and therefore, specific approaches have beenproposed [7], [8]. Therefore, QoS is a very interesting topic in network design, in all of its forms. Our proposal is focused incluster interconnects and, thus, we focus on the work which has more relationship with the proposal in thispaper. During the last decade several cluster switchdesigns with QoS support have been proposed. Next, wereview some of the most important proposals.Multimedia Router (MMR) [9] is an hybrid router thatuses pipelined circuit switching for multimedia trafficand virtual cut-through for best-effort traffic. Pipelinedcircuit switching is connection-oriented and needs oneVC per connection. This is the main drawback of theproposal because the number of VCs per physical link is limited by the available buffer size and there may notbe enough VCs for all the possible existing connections(in the order of hundreds). Therefore, the number of multimedia flows allowed is limited by the number of VCs. Moreover, the scheduling among hundreds of VCsis a complex task.MediaWorm [10] was proposed to provide QoS in awormhole router. It uses a refined version of the VirtualClock algorithm [11] to schedule the existing VCs.These VCs are divided into two groups: one for best-effort traffic and the other for real-time traffic. Severalflows can share a VC, but 16 VCs are still needed toprovide QoS. Moreover, it is well known that wormholeis more likely to produce congestion than virtual cut-through [12]. In [13], the authors propose a preemption mechanism to enhance MediaWorm performance, but inour view that is a rather complex solution.InfiniBand was proposed in 1999 by some of themost important IT companies to provide present andfuture server systems with the required levels of reli-ability, availability, performance, scalability, and QoS[14]. Specifically, InfiniBand Architecture (IBA) pro-poses three main mechanisms to provide the applicationswith QoS. These are traffic segregation with servicelevels, the use of VCs (IBA ports can have up to 16VCs), and the arbitration at output ports according toan arbitration table. Although IBA does not specifyhow these mechanisms should be used, some proposalshave been made to provide applications with QoS inInfiniBand networks [15].Finally, PCI Express Advanced Switching (AS) archi-tecture is the natural evolution of the traditional PCI bus[16]. It defines a switch fabric architecture that supportshigh availability, performance, reliability, and QoS. ASports incorporate up to 20 VCs that are scheduledaccording to some QoS criteria.  3 All the technologies studied propose a significantnumber of VCs to provide QoS support. However, if agreat number of VCs is implemented, it would requirea significant fraction of silicon area and would makepacket processing slower. Moreover, in the all cases, theVCs are used to segregate the different traffic classes.Therefore, it is not possible to use the available VCsto provide other functionalities, like adaptive routing orfault tolerance, when all VCs are used to provide QoSsupport.On the other hand, there have been proposals whichuse only two VCs in communication networks, guar-anteeing bandwidth for the premium traffic and alsoallowing regular traffic. For instance, McKeown et al.[17] proposed a switch architecture for ATM with thesecharacteristics. Avici TSR [18], proposed by Dally etal., is also a well-known example of this. In these cases,the network is able to segregate premium traffic fromregular traffic. However, this design is limited to thisclassification and cannot differentiate among more cate-gories. In the recent IEEE standards, it is recommendedto consider up to seven traffic classes [19]. Therefore,although the ability to differentiate two categories is agreat improvement, it could still be insufficient.In contrast, the novelty of our proposal lies in the factthat, although we use only two VCs in the switches,the global behavior of the network is very similar tothe performance obtained using many more VCs. This isbecause we are reusing at the switch ports the schedulingdecisions performed at the network interfaces, whichhave as many VCs as traffic classes (8 VCs in ourperformance evaluation). As we will see, the network provides a differentiated service to all the traffic classesconsidered.To the best of our knowledge, only Katevenis and hisgroup have proposed something similar before [20]. Thebasic idea of their architecture is to map the multiplepriority levels onto the two existing queues. The mappingis such that the “lower queue” usually contains packets of the top-most non-empty priority level, while the “upperqueue” is usually empty, thus being available for thehigh-priority packets that occasionally appear to quicklybypass the lower-priority traffic packets. The operationof the system is analogous to a two-lane highway, wherecars drive in one lane and overtake using the other.This is a promising idea that could be further devel-oped. As it is now, it presents a serious problem: thescalability. The proposal is a single switch that connectsto all the line cards. However, if we need severalswitches, it is not trivial how to handle all the signalingbetween the switches and the interfaces. Moreover, theproposal is tied to a very specific switch architecture, thebuffered crossbar [21]. This provides a buffer per eachcombination of input and outputs, which in turn solveshead-of-line blocking issues. However, this demands alot of buffer space. Moreover, if several switches areconnected, HOL blocking appears again and starvationcould happen.In summary, the most important proposals to provideQoS are based on the use of VCs. Most of them use 16VCs and those which use only two are not able to handleall the recommended traffic categories. If a large numberof VCs is implemented, it would require a significantfraction of chip area and would make packet processinga more time-consuming task.III. P ROVIDING FULL  Q O S  SUPPORT WITH ONLY TWO VC S In [4], we have proposed a new strategy to use only two VCs at each switch port to provide QoS whichachieves similar performance results to those using manymore VCs. In this section, we review the need of usingVCs to provide QoS, but at the same time, we justify thatonly two of them are enough to achieve this objective.  A. Motivation In the following, we will analyze why supportingmany VCs is not enough by itself to provide QoS.Moreover, we will see that it may have some negativeeffects.In modern interconnection technologies, like Infini-Band or PCI AS, the obvious strategy to provide QoSsupport consists in providing each traffic class witha separate VC. Separate VCs allow the switches toschedule the traffic in such a way that packets with morepriority can overtake packets with less priority. In thisway, head-of-line (HOL) blocking between packets of different traffic classes is eliminated. Moreover, bufferhogging (a single traffic class taking all the availablebuffer space) is also avoided, since each traffic class hasits own separate credit count.However, VCs alone do not fully solve the aforemen-tioned problems. There may be HOL blocking amongthe packets of a given VC if they are heading towardsdifferent destinations. This can be solved using virtualoutput queuing (VOQ) [18]. In this case, each inputport has a queue per global destination of the network.However, this approach is generally an inefficient solu-tion. A usual solution is to provide a separate queue peroutput of the switch. We will refer to this strategy asVOQ SW  . Although this could not solve completely theHOL blocking, it is an intermediate compromise betweenperformance and cost. In that case, the number of queues  4 required at each input port would be the number of trafficclasses multiplied by the number of output ports of theswitch.On the other hand, it is necessary to employ some kindof regulation on the traffic to provide strict guaranteeson throughput and latency. Toward this end, a connectionadmission control (CAC) can guarantee that at no link the load will be higher than the available bandwidth.Finally, providing QoS with the scheduling at switchesis not enough, there must be some scheduling at theoutput of the network interfaces as well. Thereby, thesedevices also need to implement queues to separate thetraffic classes.Therefore, we can conclude that to devote a VC pertraffic class at the switches is not enough to provideadequate QoS and other techniques and mechanisms arenecessary. More specifically, a CAC is needed to provideQoS support, network interfaces have to implement VCs,and at least VOQ SW   is needed to mitigate HOL blockingand buffer hogging.On the other hand, we have observed that once thetraffic is regulated using a CAC, it flows seamlesslythrough the network. Congestion, if any, only happenstemporarily. Therefore, regulated traffic flows with shortlatencies through the fabric. Given these conditions, todevote a different VC to each traffic class might beredundant. Most of the problems the additional VCsaddress are already solved by the rest of the mechanisms.For instance, bandwidth guarantees are achieved becausethe CAC assures that no link is oversubscribed and theVOQ SW   mitigates the HOL-blocking.Moreover, implementing a different VC per trafficclass is not usually possible because final switch im-plementations do not incorporate so many VCs due tothe associated cost in terms of silicon area.Finally, let us talk about buffer requirements. Whenusing store and forward or virtual cut-through switching,the minimum buffer space that is needed to achievemaximum throughput is one packet size plus a round-trip time (RTT) of data [22]. However, depending on thecharacteristics of traffic, like burstiness or locality, morememory at the buffers is necessary to obtain acceptableperformance.As we have mentioned before, VCs produce a staticpartition of buffer memory. That means that traffic of one VC cannot use space devoted to another VC, evenif it is available. For that reason, although VCs providetraffic isolation, they may degrade overall performanceunder bursty traffic.At Figure 1 we can see a little experiment to illustratethis. We inject 8 service levels of bursty traffic into anetwork. We evaluate two alternatives, one with eight  0 10 20 30 40 501286432168 8 VCs2 VCs     A   v   e   r   a   g   e    l   a    t   e   n   c   y    (      µ     s     ) Buffer space (Kbytes) (a) Synthetic (self-similar)  0 10 20 30 40 501286432168 8 VCs2 VCs     A   v   e   r   a   g   e    l   a    t   e   n   c   y    (      µ     s     ) Buffer space (Kbytes) (b) Multimedia (MPEG-4)Fig. 1. Average latency of QoS traffic with different buffer sizesper port (Input load = 100% link capacity; 64 end-nodes MIN; 16port switches). VCs (one per traffic class) and another with only twoVCs. The total buffer space per port in both alternativesis the same, only the management changes 1 . In theplot, we see the average latency of the four top-mostpriority service levels. We can see that the two VC designrequires only 16 Kbytes of buffer per port to achieve thebest performance, while the eight-VC alternative needsas much as 128 Kbytes per port to achieve the sameperformance.  B. New proposal Our intention is to propose a switch architecture thatuses only two VCs but still achieves traffic differentiationas if more VCs were used.Based on all the previous observations, we proposethat all the regulated traffic that arrives at a switch portuses the same VC. In this way, we put most of theeffort of the QoS provision on the network interfacesand in a proper CAC strategy, keeping the switches assimple as possible. Using the CAC, we achieve twogoals: firstly, we can guarantee throughput. Secondly, asa consequence of the admission control, there is a boundon the latency of QoS packets. Note that the CAC andthe complex network interfaces are needed anyway toprovide strict QoS guarantees.We need a second VC at the switches for unregulatedtraffic, which should also be supported. That VC is usedby the best-effort traffic, which can suffer from conges-tion. In order to avoid any influence on the regulatedtraffic, we give the latter absolute priority over best-efforttraffic. Using just two VCs at the switches and providedthat there is regulation in the traffic, we will obtain verysimilar performance as if we were employing many moreVCs.At the end-points, there are schedulers that takeinto account the QoS requirements of traffic classes. 1 Packet size is tuned so there is always one packet size plus a RTTof buffer per VC. On the other hand, latency results are at messagelevel, which in all the cases involves several packets.  5 Therefore, packets leaving the interfaces are ordered bythe interface’s scheduler. If packet  i  leaves earlier thanpacket  i +1 , it is because it was the best decision (withthis scheduling strategy) to satisfy the QoS requirementsof both packets, even if packet  i  + 1  was not at theinterface when packet  i  left. Therefore, we can assumethat the order in which packets leave the interfaces iscorrect.For the purposes of the switches, it is safe to assumethat in all the cases packet i  has more priority than packet i  + 1 . In this case, the switch is receiving at its inputports ordered flows of packets. Now, its task is analogousto the sorting algorithm: it inspects the first packet ateach flow and chooses the one with the highest priority,building another ordered flow of packets.Note that what “priority” means will depend on theactual scheduling at the network interfaces. For instance,if absolute priority between traffic classes is applied, thenthe scheduler at the switches has to consider the srcinalpriority of the packets at the head of the queues, insteadof just whether they are regulated traffic or not. If, forinstance, the switch has four ports, the scheduler looksat the first packet of the four buffers and chooses theone with the highest priority. This is not very complexbecause very efficient priority encoder circuits have beenproposed [23]. Note that this cannot lead to starvation onthe regulated traffic because the CAC assures that thereis enough bandwidth for all the regulated flows.Thereby, by using this scheduler, the switches achievesome reutilization of the scheduling decisions made atnetwork interfaces. This is because the order of theincoming messages is respected, but at the same time,the switches merge the flows to produce a correct orderat the output ports. Note that a different scheduler, likeround-robin or iSLIP, would not merge the packets in thebest way and the latency of the packets with the highestpriority would be affected.A drawback of our technique is that the switches arenot able to reschedule traffic as freely as they would bewith a technique where a different VC for each trafficclass were implemented. This problem is attenuated bythe connection admission, because connections are onlyallowed if we can satisfy their bandwidth and latencyrequirements all along the path of packets. That meansthat the connections are established as if all the VCswere implemented at the switches and there were alsothe same schedulers as in the switches with all theVCs. In this way, we ensure that the required QoSload is feasible. We will not obtain exactly the sameperformance, but it will be very similar.On the other hand, the best-effort traffic classes onlyreceive coarse-grain QoS, since they are not regulated.However, the interfaces are still able to assign the avail-able bandwidth to the highest priority best-effort trafficclasses and, therefore, some differentiation is achievedamong them. If stricter guarantees were needed by aparticular best-effort flow, it should be classified as QoStraffic. Therefore, although best-effort traffic can obtaina better performance using more VCs, the results do not justify the higher expenses.Note that this proposal does not aim at achieving ahigher performance but, instead, at drastically reducingbuffer requirements while achieving similar performanceand behavior of systems with many more VCs. In thisway, a complete QoS support can be implemented at anaffordable cost.Summing up, our proposal consists in reducing thenumber of VCs at each switch port needed to provideflows with QoS. Instead of having a VC per traffic class,we propose to use only two VCs at switches: one for QoSpackets and another for best-effort packets. In order forthis strategy to work, we guarantee that there is no link oversubscribed for QoS traffic by using a CAC strategy.IV. S WITCH ARCHITECTURE In this section, we describe thoroughly the proposedswitch architecture. We study a 16 port, single-chip,virtual cut-through switch intended for clusters/SANsand for a 8 Gb/s line rate. We assume QoS supportfor distinguishing two traffic categories: QoS-requiringand best-effort traffic. Credit-based flow control is usedto avoid buffer overflow at the neighbor switches andnetwork interfaces. For the rest of the design constraints,like packet size, routing, etc. we take PCI AS [16] as areference model.The block diagram in Figure 2 (a) shows the switch or-ganization. We consider a  combined input/output queued  (CIOQ) switch because it offers line rate scalability andgood performance [24]. Moreover, it can be efficientlyimplemented in a single chip. This is necessary inorder to offer the low cut-through latencies demandedby current applications. Moreover, this also allows toprovide some internal speed-up, without the need of faster external links.In the CIOQ architecture, output conflicts (severalpackets requesting the same output) are resolved bybuffering the packets at the switch input ports. Packetsare transferred to the switch outputs through a cross-bar whose configuration is synchronously updated bya central scheduler. To cope with the inefficiencies of the scheduler and packet segmentation overheads 2 , the 2 Crossbars inherently operate on fixed size cells and thus externalpackets are traditionally converted to such internal cells.
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks