Benefits of Job Exchange between Autonomous Sites in Decentralized Computational Grids

Benefits of Job Exchange between Autonomous Sites in Decentralized Computational Grids
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Benefits of Job Exchange between Autonomous Sites in DecentralizedComputational Grids Christian Grimme, Joachim Lepping, and Alexander PapaspyrouRobotics Research InstituteSection Information Technology (IRF-IT)Dortmund University of Technology,44221 Dortmund, Germany Abstract This paper examines the job exchange between parallelcompute sites in a decentralized Grid scenario. Here, thelocal scheduling system remains untouched and continuesnormal operation. In order to establish the collaborationand interaction between sites in a Grid context, a middle-ware layer that is responsible for the migration of jobs issupplemented. Independent users are assumed to submit their jobs to their site-local middleware layer, which in turncan request jobs for execution from alien sites. The simu-lation results are obtained using real workload traces and compared to the performance of the EASY Backfilling al-gorithm in an equal single-site scenario. It is shown that collaboration between site is beneficial for all high utilized  participants as it is possible to achieve shorter responsetimes for jobs compared to the best single-site schedulingresults. 1 Introduction During the last decade, Computational Grids have be-come the de-facto standard for satisfying the ever-growingdemand for High Performance Computing (HPC) power:although many Massive Parallel Processing (MPP) systemshave been put into service, emerging problem domains suchas Earth System Science and High Energy Physics [8] relyon extensive simulations, easily exhausting location-boundprocessing resources. Since the world-wide network infras-tructures are becoming increasingly powerful, an almost in-stant migrationof workload has nowadays become possible.Regarding the architectural model of such Computa-tional Grids, a manifold of proposals has been made: cen-tralized and decentralized structure, hierarchical and equi-table interaction, direct and indirect communication, andcombinations of these. Although all propositions have van-tages and deficiencies, decentrally organized, equitable, anddirectly interacting architectures have become a trend in re-search [4], due to their natural assets such as fault tolerance,loose coupling, and high dynamics.Especially, the latter are highly important in most real-world scenarios: virtually all owners of MPP machines par-ticipatinginaComputationalGridwishtoretainfullcontrolover the management of his investment. All the more, in-formation on the current or estimated future performance of the local system performance is usually considered sensi-tive and as such kept classified: many systems expose littlemore dynamic data than the number of currently free re-sources and the current queue length.Naturally, this complicates the allocation of workload onthe overlay level: Grid scheduling heuristics must expose anoverall good performance under the assumption that theirsole interface to the concrete resource is the local batchmanagement system which applies its own algorithms, typ-ically tailored to the local user community’s needs.Here, we explore the potential of workload migrationand exchange in such Grid environments. Following theaforementioned scenario, we assume a federation of inde-pendent MPP sites and apply on the Grid scheduling layera decision policy that allows active requests for jobs fromremote resources. On this basis, we conduct an empiri-cal analysis on the performance of common objectives suchas utilization and response time. For this purpose, we useworkload trace data from real MPP installations.The remainder of this paper is organized as follows: inSection 2 common scenarios are reviewed and the used ap-proach is motivated. Then, in Section 3, an overview of the surveyed setup with its topology, communication modeland migration strategy is given. Section 4 comprises an ex-planation of applied performance objectives. Following, inSection 5, the experimental setup is introduced and the sim-ulation results are discussed. Finally, in Section 6, conclu-sions are drawn and a prospect for future work is made. Eighth IEEE International Symposium on Cluster Computing and the Grid  978-0-7695-3156-4/08 $25.00 © 2008 IEEEDOI 10.1109/CCGRID.2008.5525  2 Background & Classification Currently, a plethora of projects and installations of Computational Grids in both scientific and enterprise do-mains are available. While all being different in their re-quirements, use cases, and middlewares, there are commonaspects to be identified on the architectural and businesslevel.In general, most existing prototype Grid installationscomprise a federation of partners that share a common goal,such as a specific project, and pool their MPP resources.However, each participating site runs its own schedulingsystems which is driven by certain, provider-specific poli-cies and rules regarding the planning scheme for incomingworkload. The schedulers themselves are often incapableto interact with schedulers from a different administrativedomain, even within the same project.Regardless of whether users in such a project have todecideonthedistributionoftheirjobsbythemselves, orthisis done in an automatic manner, the local scheduling layeris not adapted to the Grid environment. Furthermore, it canbe assumed that there is a general notion of trust betweenthe partners within a Grid effort. That is, pooled resourcesare shared in a collaborative manner and the delegation orexchange of workload is a generally desired effect.From a scheduling point of view, we identify two com-mon topologies. In  centralized systems , a central schedul-ing system—often called Grid or Meta Scheduler—acceptsthe submission of all workload within the Grid environmentindependent of its srcin. It then decides on the distribu-tion among the participating MPP sites that fit the users’ re-quirements, and delegates the execution to the responsibleMPP site scheduler which in turn assigns the work to thelocal resources. This model is very common for major Gridmiddleware solutions [1, 2], and has been well-studied inseveral independent community-specific Grid projects [6].An obvious specialization includes multiple levels, com-prising a hierarchy of schedulers. There, schedulers on ahigher level assign work to schedulers on a lower level, typ-ically using mechanisms such as direct resubmission or de-cision negotiation.The main advantage of this concept is the ability to in-corporate different scheduling strategies and policies on thedifferent levels as well as a global view of the top-levelMeta Scheduler regarding submitted workload. However,to allow a globally effective and coordinated job-resource-mapping, local sites have to delegate administrative controlto the Grid infrastructure services; otherwise, only best-effort QoS can be guaranteed on the Grid level. Moreover,this approach inherently lacks scalability and, since havinga single point of failure, offers little or no fault-tolerance.In  decentralized systems  local schedulers at each partic-ipating MPP site accept the submission of workload froma local user community. The incoming jobs are then ei-ther assigned to local resources or, by interacting with otherschedulers in a direct or indirect fashion, migrated to othersystems. Current development, however, mainly focuses onthe provision of an adequate infrastructure for Grids whichenablepeer-to-peerinteractionofdistributedresources[11].The reason for this lies in one of the main advantages of this concept: the absence of a global system view requiresmore sophisticated scheduling concepts, admittedly provid-ing better reliability due to the lack of a single failure point.Consequently, the benefit of peer-to-peer interaction of lo-cal schedulers has not been analyzed in an exhaustive way,although first promising results have been shown in a partlydecentralized system model [9]. 3 Examined Scenario With respect to the basic parameters regarding organi-zational autonomy and equity, we assume a ComputationalGrid scenario with independent sites, see Figure 1, whichintegrate their available MPP systems into a federation of resources that may be used within a prescribed, project-driven community. As mentioned before, the owners donot cede the control over their resources to a Meta Sched-uler, but employ their own batch management systems witha given queue configuration and scheduling heuristics. Fur-thermore, no participant exposes the characteristics of its job queue to other sites, and direct access to this informa-tion is forbidden. Figure 1. Computational Grid scenario withindependent, autonomous and equitablesites in a federated environment. Following, we detail our machine model, job model, and 26  scheduling system setup, and introduce our migration strat-egy. 3.1 Machine Model A single-site MPP system of site  k  is modeled by  m k identical parallel machines. This means a parallel job canbe allocated on any subset of nodes of a machine. However,the local user submission behavior is adapted to the localsite configuration. Thus, jobs which require a large set of processors might not be executable on all participating sitesas their requested number of resources may exceed the totalnumber of available resources on certain sites. 3.2 Job Model We assume rigid parallel batch jobs for our analysis asthis type is prevalent on most MPP systems. A job  j  isneither moldable nor malleable and requires concurrent andexclusive access to  m j  ≤  m k  processing nodes with  m k beingthetotalnumberofnodesontheMPPsystematsite k .The number of required processing nodes  m j  is available atrelease date  r j . The completion time of a job  j  on site  k  inschedule  S  k  is denoted by  C  j ( S  k ) . Further, we do not allowany preemptions of jobs [3]. Our system model requiresthe users to provide an estimate  ¯  p j  of the processing time,which is used to cancel jobs that exceed their estimate inorder to protect the system from overuse by possibly faulty jobs. 3.3 Scheduling System During operation in an MPP scheduling system, allsubmitted jobs are inserted into the  waiting queue . The scheduling algorithm  is then invoked on a regular basis andallocates waiting jobs onto the available machines. As jobsare submitted over time and neither the precise processingtime  p j  nor future job submissions are known in advance,scheduling of MPP systems is an online problem.For simplicity, we do not allow any advanced queuing orpartitioning scheme and use a single queue for all jobs. Re-garding the scheduling algorithm, we assume the usage of one of the most popular algorithms, EASY Backfilling, dueto its widespread use in existing batch management systemsas well as its overall good performance [10, 7].EASY Backfilling works as follows: First, the job at thehead of the waiting queue is examined. If this job can bestarted immediately on the locally available nodes, it is re-moved from the waiting queue and executed directly. Oth-erwise, its earliest possible start time is assumed as a reser-vation, and for every other waiting job the following twoconditions are tested: (1) it will terminate before the first job is expected to commence 1 and (2) it will not interferewith the nodes reserved for the first job. The first candidatethat meets either condition is used as a backfill and startedimmediately.So far, the setup matches the typical, simple environmentthat can be found in most of the currently used real worldsystems. 3.4 Grid Component Following, wedetailtheGridcomponentofourscenario,introducing our middleware layer and migration strategy. 3.4.1 Middleware Layer Regarding Grid level interaction, we introduce an additionalmiddleware layer in the given setup which provides ex-change services between local sites while leaving the localscheduling system untouched. This layer is deployed on aper-site basis and mainly consists of a  decision maker   com-ponent thatprovidesinterfacesfor jobmigrationson the onehand and job submissions on the other hand. Contrary tousual local batch systems, users submit their jobs now overthemiddlewarelayerandarenotallowedtointeractwiththebatch system directly. From the user’s point of view, this in-termediate service can be made completely transparent byexposing a standard batch system submission interface suchas DRMAA or the  qsub  family commands from POSIX.The decision maker acts regarding to a certain policy thatregulates when to request, offer, or accept remote jobs. Asin real MPP installations, where remote system informationis usually classified and therefore only partially available tothe public, we consequently limit the amount of informationavailable to the decision maker: only the current utilizationand the locally waiting workload are known. While the for-mer is typically exposed by most batch management sys-tems and Grid middleware stacks [5] at least within the lo-cal system domain, the latter can be tracked by the decisionmaker component itself, since it acts as a bridge betweenthe user and the local batch management system.On this basis, the decision maker may offer subsets of  jobs from the current waiting queue to other decision mak-ers from other sites. Note, however, that the selection which jobs to expose to alien decision makers remains in the do-main of the local site. 3.4.2 Migration Strategy In the work at hand, we apply a very simple interchangestrategy that can be used under the aforementioned infor-mation restrictions, see Figure 2. This strategy can be splitup into two parts: an acceptance policy and a distribution 1 Based on the user’s runtime estimation. 27  policy. The acceptance policy works as follows: jobs that  Add job to localqueueRequest remotesite for executable jobs  All known sitesrequested?  yesno Job can be startedimmediately? Check nextoffered remote jobs  All remote jobschecked? noyesnoyes local jobsubmission Figure 2. Flow chart of the applied migrationstrategy. are submitted from the local user community are invariablyadded to the end of the queue. Whenever the first job inthe queue cannot be executed immediately 2 , it is checkedwhether any known remote site offers a job that could bebackfilled with respect to the EASY heuristic. To this end,a request for available jobs is sent sequentially to all sites,and the returned set of answers is tested on conformance.All jobs that fulfill either criterion, see Section 3.3, are thenadded to the end of the local queue. This is repeated forevery known remote site.As mentioned before, the set’s composition that is re-turned on an offer request is completely at the remote site’sdiscretion. In our scenario, this distribution policy appliesno restrictions whatsoever on the selection, such that allwaiting jobs are returned; only a prefiltering regarding themaximum number of available nodes on the requesting siteis applied to ensure that offered jobs are generally runnable 3 for the requester. 4 Definition of Performance Objectives In order to measure the schedule quality and to quantifythe effect on job migrations, we define several objectives. Ingeneral, we denote the set of jobs that have been submittedlocally to site  k  by  τ  k  and all jobs that have been actuallyprocessed on site  k  by  π k . 2 The checking interval, however, is up to the local scheduling system. 3 That is, the requester meets the minimum resource requirements forthe job. 4.1 Squashed Area and Utilization The first two objectives are  Squashed Area  SA k  and  Uti-lization  U k , both specific to a certain site  k . For a sched-ule  S  k , they are measured from the earliest job start time C  min,k  = min j ∈ π k { C  j ( S  k ) −  p j }  up to the latest job com-pletion time  C  max,k  = max j ∈ π k { C  j ( S  k ) } , the makespan.SA k  denotes the overall resource usage of all jobs thathave been executed on site  k , see Equation 1.SA k  =  j ∈ π k  p j  ·  m j  (1)U k  describes the ratio between total resource usage andavailable resources after the completion of all jobs  j  ∈  π k ,see Equation 2. U k  =  SA k m k  ·  ( C  max,k  −  C  min,k )  (2)This is the usage efficiency of the site’s available ma-chines and is therefore often serving as a schedule qualityobjective from the site provider’s point of view.In order to measure global utilization with respect to thewhole Grid scenario  K   we define the overall utilization U o in Equation 3. U o  =  k ∈ K   j ∈ τ  k  p j  ·  m j  k ∈ K  m k  ·  max k ∈ K  { C  max,k } −  min k ∈ K  { C  min,k }   (3)This objective is computed for both the non-cooperativeand the job sharing case, respectively. The changes in over-all utilization indicate how effectively the connected re-sources can be used due to the interchange of jobs.However, comparing single-site and multi-site utiliza-tion values is illicit: since the calculations of U k  and U o depend on  C  max,k , valid comparisons are only admissibleif   C  max,k  is approximately equal between the single-siteand multi-site scenario. Otherwise, high utilizations mayindicate good usage efficiency, although the corresponding C  max,k  value is very small, showing that only few jobs havebeen computed locally.As such, we additionally introduce the  Change of Squashed Area  ∆ SA k , which provides a makespan-independent view on the utilization’s alteration, see Equa-tion 4. ∆ SA k  =  SA k  j ∈ τ  k  p j  ·  m j (4)From the system provider’s point of view this objectivereflects—compared to the local execution—the real changeof utilization when jobs are shared between sites. 28  4.2 Average Weighted Response Time For our third objective we switch focus towards a moreuser-centric view and consider the  Average Weighted Re-sponse Time  AWRT k  relative to all jobs  j  ∈  τ  k  that havebeen initially submitted to site  k , see Equation 5. Note thatthis also respects the execution on remote sites and, as such,the completion time  C  j ( S  )  refers to the site that executed job  j . AWRT k  =  j ∈ τ  k  p j  ·  m j  ·  ( C  j ( S  )  −  r j )  j ∈ τ  k  p j  ·  m j (5)A short AWRT describes that on average users do not waitlong for their jobs to complete. Following Schwiegelshohnet al. [12], we use the resource consumption  (  p j  ·  m j )  of each job as weight. This ensures that neither splitting norcombination of jobs can influence the objective function ina beneficial way. 4.3 Migration Rate Finally, we measure the amount of migration in themulti-site scenarios. To this end, we introduce the migra-tion matrices  M  n  that show the ratio of job disseminationwith respect to the srcinal submission site, see Equation 6.There, rows denote the sites where jobs have been srci-nally submitted to while columns specify the actual execu-tion sites. M  n  =  | π 11 || τ  1 |· · · | π 1 k || τ  1 | ......... | π l 1 || τ  l |· · · | π lk || τ  l |  ,  { l,k } ∈  [1 ···| K  | ]  (6)Additionally, in order to measure the actual amount of workload that has been migrated between sites, we intro-duce the matrix  M  SA  similar to  M  n  that relates the over-all SA per workload with the migrated portions. Exem-plarily, the Squashed Area for the set  π kk  is computed as SA π kk  =   j ∈ π kk  p j  ·  m j . 5 Evaluation In order to achieve realistic results from our simulations,we rely on exemplary workloads from the Parallel Work-loads Archive 4 . It provides job submission and executiontraces recorded on real-world MPP system sites that con-tain information on relevant job characteristics. 4 Although these workloads traces were originally notrecorded on sites participating in a Computational Grid,they include all obvious and hidden dependencies and cor-relations between job submissions and are, as such, to bepreferred over synthetic data. All the more, the amountof available traces from sites participating in Grid environ-ments is still very limited, see for example the Grid Work-load Archive 5 . 5.1 Input Data For our evaluations, we restrict the set of used workloadsto those which contain valid runtime estimates, since thereference algorithm depends on this data. Furthermore, weapplied various pre-filtering steps to the srcinal—partiallyerroneous—data so that we simulate only valid entries. Identifier #Jobs  m k  Months Site SetupI II IIIKTH-11 28479 100 11 X X XCTC-11 77199 430 11 X XSDSC00-11 29810 128 11 X XSDSC03-11 65584 1152 11 X XSDSC05-11 74903 1664 11 X X Table 1. Details on the shortened and filteredworkloads obtained from the Parallel Work-load Archive. The srcinal workloads record time periods of differentlength. In order to be able to combine different workloadsin a multi-site simulations we shortened the workloads tothe minimum required length of all participating workloads.Details on the used traces are given in Table 1; the srcinalinput data files can be obtained from the authors’ web site 6 . Identifier AWRT U SA  C  max KTH-11 75157.63 68.72 2017737644 29363626CTC-11 52937.96 65.70 8279369703 29306682SDSC00-11 73605.86 73.94 2780016139 29374554SDSC03-11 50772.48 68.74 23337438904 29471588SDSC05-11 54953.84 60.17 29392365928 29357277 Table 2. AWRT (in seconds), U (in %), SA, and C  max  (in sec.) for all workloads, EASY back-filling, and exclusive single site execution. Before presenting the detailed multi-site results it is nec-essary to show the exclusive single-site results that willserve as a reference. Table 2 shows the above introducedperformance metrics for the five examined workloads whenthe EASY Backfilling algorithm is applied and no job ex-change is allowed. Here, it becomes already apparent that 5 6˜lepping/workloads/ 29
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!