Health & Lifestyle

A General Framework for Service Availability for Bandwidth-Efficient Connection-Oriented Networks

Description
A General Framework for Service Availability for Bandwidth-Efficient Connection-Oriented Networks
Published
of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 3, JUNE 2010 985 A General Framework for Service Availability forBandwidth-Efficient Connection-Oriented Networks Ori Gerstel  , Fellow, IEEE  , and G. Sasaki  Abstract— Availability in connection-oriented service in net-works has traditionally been “all-or-nothing,” i.e., when a failureoccurs, a connection either is unprotected or fully protected. Thedifferences in availability and cost between these two extremes canbe quite high. A general framework for service availability willbe presented that fills the gap. It is shown how network resourcesand cost are related to service parameters of the frameworkfor networks that are bandwidth-efficient. In addition, a simplerevenue model is presented and characterized, revealing whennontraditional service agreements may be attractive.  Index Terms— Protection switching, service availability, servicelevel agreements, survivable networks. I. I NTRODUCTION M OSTprotectionschemesinanetworkattempttoachievethe availability specified in a customer’s service levelagreement 1 (SLA) by “all-or-nothing” switching, i.e., whenevera fault occurs, a connection on the fault is completely protectedor not for the duration of the fault. There are great differences inavailability and cost between these two extremes. For example,a protected connection may have a high availability of 99.999%, while an unprotected connection could have a muchlower availability of 99.9% or even lower. In addition, theconnection may use more than twice the network resources asan unprotected connection since the disjoint working and pro-tection paths of a connection are together at least twice ashortestpathofanunprotectedconnection.Whilesharedprotec-tionschemesreduceresourceusage,theystillrequiresignificantprotection resources—especially for sparse topologies such asrings—and do not provide availability guarantees for connec-tions that are not 99.999% protected. So among the limited se-lection of classical protection services, there is a high tradeoff between availability and network cost, which ultimately affectscustomer prices. What is needed is to bridge the gap between Manuscript received March 09, 2008; revised November 06, 2008; February21, 2009; and August 13, 2009; approved by IEEE/ACM T RANSACTIONS ON N ETWORKING  Editor A. Somani. First published April19, 2010; current versionpublished June 16, 2010.O. Gerstel is with Carrier Routing Business Unit, Cisco Systems, Natanya42504, Israel (e-mail: ori@ieee.org).G. H. Sasaki is with the University of Hawaii, Honolulu, HI 96822 USA(e-mail: galens@hawaii.edu).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TNET.2010.2046746 1 For sake of simplicity, we refer to the availability defined in the SLA as “theSLA.” The SLA contains many other aspects, such as maximum latency, jitter,as well as support guarantees, but they are outside the scope of this paper. theseclassicalprotectionservicessothatacustomerwillbeableto find the right SLA at the right price.The purpose of this paper is to propose a general framework for the SLA toward this goal while keeping in mind the un-derlying technologies for implementation. The framework hasthe following practical advantages. It can be implemented withcurrent and foreseeable technologies at both the optical layerand electronic packet switched layer (e.g., Ethernet, MPLS, IP).Its parameters can be measured to verify SLA compliance. Itleadstobandwidthefficiencyinthenetwork,whichinthispapermeansthatthereisminimalorpossiblynoadditionalbandwidthbeyond the necessary working bandwidth. This is important tokeep connection prices low.The framework allows protection bandwidth to be a fractionof the working bandwidth such as in [1] and [2]. It also allows connections that are not directly on a fault to be interrupted tofree surviving bandwidth for protection such as in [3] and [4]. This is a departure from the usual telecommunication practiceof not disturbing established connections, but it allows greaterflexibility in optimizing surviving bandwidth, leading to a moreholistic network protection:  Definition:  Network protection is a means to redistribute thelimited bandwidth that survived a network failure, among allthe services supported by the network with the single goal of ensuring they all meet their SLAs.In addition, the framework will introduce features that ad-dress the following weakness of conventional availability spec-ifications. The availability of a connection is typically specifiedby a percentage, e.g., 99.9%. For an operating period of say ayear, the connection will be unavailable for at most 8 hours and46 min. Note that the connection can be continually down for8 hours and 46 min and still meet its SLA. This may be too longfor a customer, who may prefer to limit any continuous down-time to a couple of hours, and spread out the downtimes. TheSLA framework of this paper addresses this by ensuring avail-ability over short periods.The paper is organized as follows. Related work is discussedinSectionII,andtheSLAframeworkispresentedinSectionIII.Section IV describes how the SLA framework affects network cost, and in particular the required link bandwidths. In the sec-tion, and throughout this paper, the system is assumed to becomposed of two network nodes connected by a pair of con-nections that pass through a network as shown in Fig. 1. Theconnections basically serve as a pair of links between the twonodes,andtheywillbereferredtoas“links”1and2.The“links”areassumedtohavethesamebandwidthandhavedisjointphys-ical paths, and that they do not fail together. It will be assumedthroughout the paper that the total time that link is down is at most . In addition, the time to repair a link is 1063-6692/$26.00 © 2010 IEEE  986 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 3, JUNE 2010 Fig. 1. Point-to-point system through a network. at most . The values , and are assumed to be known a priori . Presumably, at least conservative estimates are knownby service providers. Throughout the paper, the following no-tation will be used: . Note that is an upperbound on the total amount of time that there is some failure.It will also be assumed that the links carry a total of con-nections, and each connection operates over the time interval. The duration will be referred to as the  lifetime of thenetwork  . In Section IV, a lower bound on link bandwidth re-quirements is given. It will be shown that the lower bound isnearly achievable for simple but important special cases.Section V presents scenarios when nontraditional protectionschemes may be economically attractive. Simple economicmodels are used to illustrate the tradeoffs. Implementation isdiscussed inSectionVI. It isworth noting thatthenetwork mustnow keep track of state information per connection. Section VIIhas final remarks. It includes directions for future research anda discussion of generalizations of the assumption shown inFig. 1 from two links to multiple links.II. R ELATED  W ORK There have been a number of proposals to bridge the gapbetween full protection and unprotected service. Several pa-pers—such as [5]—propose that connections be given priori-ties based on their SLA, and survivability of a connection de-pends on its relative priority among the other connections. Such“best effort” approaches that depend on the SLA of other con-nections may be unsuitable for applications that require serviceguarantees.The Quality of Protection (QoP) framework in [1] is an SLAframework with survivability guarantees that are independentof other connections. Here, the availability of connections is ac-commodated in one of two ways: 1) connections are given theamount of bandwidth they were promised under a failure con-dition—even if a fraction of the bandwidth under normal condi-tions; 2) a probabilistic scheme in which connections get theirfull bandwidth according to a probability that is based on theirpriority. Note that the QoP framework the first way is also con-sidered in [2]. Also note that the QoP framework the secondway is an “all-or-nothing” protection switching. Another prob-abilistic approach is presented in [6], which describes a mathe-matical optimization framework. In [7], routing under a proba-bilistic framework is considered.In [3] and [4], connections that are not on a fault can be in- terrupted or “victimized,” allowing greater flexibility to meetSLAs of all connections. In the case of [3], the connectionsprotect a fraction of their working bandwidth. This approachcan be viewed as a generalization of the “extra traffic” con-cept in SONET. Our approach also blurs the boundary betweenworking and protection bandwidth, as ultimately the system canusebothtoprotectconnections.Itshouldbenotedthattheselec-tion of which connections to victimize must be done carefullyto ensure SLAs are satisfied.In [4], the accumulated downtime of connections are kepttrack of and used to determine the connections to victimize. Itis shown in [4] that naive greedy approaches can fail. Accumu-lated downtimes of connections are also used in [8]. In [9], it is discussedthattheguaranteedmaximumaccumulateddowntimemay need a safety margin to take into account the time to repair.III. SLA F RAMEWORK The SLA framework for a connection will be described.The framework has components referred to as SLA1–SLA4. SLA1 : The connection has two states,  working  and  protec-tion , corresponding to two bandwidths, the working state band-width and the protection state bandwidth , respectively. 2 The working state is the normal state for the connection, and theprotection state occurs only when there is a fault somewhere inthe network. So when there are no faults in the network, all con-nections are in their working state, and if there is a fault, thensome connections may be in the protection state, and the rest of the connections are in the working state. The set of connectionsthat are in the protection state may change over time during afault. The protection state bandwidth is a fraction of andcan be zero. The following simplifying assumption will be usedin the subsequent sections. For all connections , the workingstate bandwidth is , and the protection state bandwidthis . SLA2 :Theconnectionhasamaximumaccumulatedtimethat the connection is in the protection state. Note that the ser-vice availability of the connection must be at least ,where is thelifetime ofthenetwork. Forexample,iftheavail-ability is 99.999% and is a year, then is 5.26 min. willbereferredtoasthe maximumaccumulatedprotectionstatetime for connection .SLA1 and SLA2 cover classical connection services: un-protected, fully protected, and low-priority preemptible (extratraffic in SONET nomenclature) services. For unprotectedconnections , where is the link that connec-tion is normally carried. For fully protected connections (i.e.,availability is 100%), . For low-priority preemptibleconnections, , since any failure will impact theseconnections due to their preemption—even if they were notdirectly impacted. SLA1 and SLA2 also cover protectionschemes of  [1]–[3] and [4]. The next two SLA components, SLA3 and SLA4, specify theavailability of a connection over short time periods. SLA3 : Whenever the connection goes into the working state,it must remain in the state for at least a minimum amount of time before going to the protection state. This ensures thatservicesthatrequiretheworkingstatebandwidthhavesufficienttime to be completed. For example, video streams for movies 2 Such an SLA is feasible if connection interfaces can transmit at two band-width rates. Section VI presents more implementation details.  GERSTEL AND SASAKI: FRAMEWORK FOR SERVICE AVAILABILITY FOR BANDWIDTH-EFFICIENT CONNECTION-ORIENTED NETWORKS 987 require bandwidth for 60 to 90 min. The parameter will bereferred to as the  minimum working state duration .The following assumption will be used in the next sectionto ensure SLA3 with minimal link bandwidth. Let the time be-fore the first failure and the times between consecutive faultsbe referred to as  fault-free periods . Assume that the  minimum fault-free duration  is at least , where is the minimumworking state duration for connection . Note that this assump-tion should be reasonable if the minimum working state dura-tionsaremuchsmallerthanthemeantimetofailureforthelinks.Without this assumption, a link could go down, come back up,and then immediately go back down again, possibly leading toa violation of SLA3. For example, suppose each link is 30 Gb/sand carries three connections, where the working state band-width is Gb/s and the protection state bandwidth isGb/s. Note that the links have minimum bandwidth forthe connections in their working state, and if a link goes down,thenthesurvivinglinkhasjustenoughbandwidthforallconnec-tions in their protection state. Now when one of the links goesdown, all six connections go into their protection states, eachwith 5Gb/s onthesurvivinglink.When thelinkcomesback up,all connections are required to be in their working state. How-ever, when the link immediately goes back down again, there isnot enough surviving bandwidth to ensure all connections canbe in their working state for their minimum working state dura-tions, violating SLA3. To ensure SLA3, more link bandwidth isneeded, but then the links are less bandwidth-efficient. SLA4 : Whenever the connection goes into the protectionstate, it must transition to the working state after at mostamount of time. The parameter will be referred to as the maximum protection state duration . In addition, over any timeperiod , the amount of time that the connection is in theworking state is at least , whereis a parameter satisfying and referred to as the short-term availability rate . Note that this has the same form asthe quality-of-service guarantee definition in [10].This ensures that working state bandwidth will resumewithin a tolerable prescribed delay . An example applicationis rescheduling a video conference meeting within a couple of hours. It also ensures that the connection has access to workingstate bandwidth for a fraction of time that is approximatelyduring faults, and can be chosen high enough to providegood average throughput. An example application that needsgood throughput is offline backup.The next lemma shows how the maximum protection stateduration and minimum working state duration imply anaccess rate to working state bandwidth.  Lemma 1:  Suppose a connection has a minimumworking state duration and a maximum protection stateduration . Then, during any interval , the amountof time the connection is in the working state is at least.The proof of the lemma is given in Appendix A. Fromthe lemma, it can be assumed without loss of generality that. To achieve bandwidth efficiency, in manycases, the value of must be strictly larger than .For example, suppose in Fig. 1 there are two connections,1 and 2, that are on links 1 and 2, respectively, when thereare no faults. Suppose they have working state bandwidthand protection state bandwidth 0. Suppose connection 1 hashour, and connection 2 has hours.Suppose , so whenever link 2 has a fault, theconnections are in the working state for approximately 50% of the time. However, with these values of and , if there is afault on link 2 for say 8 hours, then connection 2 will be in theworking state for at least 4 hours, during which connection 1will also be in the working state at some time. Since bothconnections will be in the working state on link 1 at the sametime, link 1 must have bandwidth 2 . So even though thetwo connections each only require average bandwidth onlink 1 during the fault, the link must have bandwidth 2 , andthe link will be utilized at only about 50%.Note that subsets of the SLA components can be disabled bychoosing appropriate parameter values. For example, to disableSLA1, SLA2, SLA3, or SLA4, the parameters can be chosen sothat , or , respectively.The following are mild assumptions on the SLA parametersto simplify results in Section IV.  Assumption 1:  Without loss of generality, each connection k is assumed to satisfy: (for SLA2); (forSLA4); and (for SLA4).  A. Examples of Service Mixes To add some intuition to the variety of options that our gen-eral SLA framework provides, this example presents possibleservice mixes for Fig. 1 when each link is 30 Gb/s, and eachlink has three 10-Gb/s connections, so there are six connectionsaltogether. The following are services that could be supportedby the system. Service Mix 1:  All six connections are unprotected. Service Mix 2:  Three of the connections are fully protected,and the other three are low-priority. This is a classical SONETscenario with extra traffic. Service Mix 3:  Whenever a fault occurs, all six connectionshave protection state bandwidth Gb/s. This scenariowould apply to a real-time application that can fall back to a de-gradedperformanceatacertainreduceddatarate.Anexampleisgiven in the Section VI of connections carrying high-definitionTV (HDTV) video, but when there is a fault, the connectionsreduce their bandwidth to carry standard-definition TV (SDTV)video. Service Mix 4:  Whenever a fault occurs, half of the connec-tions are at 10 Gb/s, and the other half have no bandwidth. Theconnections take turns at having the working state bandwidth of 10 Gb/s and switch every 2 hours. This is a “rolling blackout”strategy to share the surviving bandwidth. This corresponds toProposition 4 in Section IV, where hours, ,and a parameter of the proposition is equal to 2. This scenariocorresponds to a nonreal-time application, such as data centerbackup during off hours. Service Mix 5:  Whenever a fault occurs, four connectionshaveprotection state bandwidth of 2.5 Gb/s, and two of the con-nections have working state bandwidth of 10 Gb/s. The connec-tionstaketurnsathavingtheworkingstatebandwidthof10Gb/severy 2 hours. This corresponds to Proposition 4 in Section IV,
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks