A Path-Driven Loop Scheduling Mapped onto Generalized Hypercubes (PLEASE REFERENCE IN YOUR PAPERS)

A Path-Driven Loop Scheduling Mapped onto Generalized Hypercubes (PLEASE REFERENCE IN YOUR PAPERS)
of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Path-Driven Loop Scheduling Mapped onto Generalized Hypercubes Hai Jiang (Member, IEEE),A. T. Chronopoulos (Senior Member, IEEE)Department of Computer ScienceWayne State University,University of Texas, San,atc@cs.utsa.eduG. Papakonstantinou,P. TsanakasDept. of Electrical and Computer EngineeringNational Technical University of AthensAthens, Greece Abstract One of the important issues in automatic code par-allelization is the scheduling and mapping of nested loopiterations to different processors. The optimal scheduling problem is known to be NP-complete. Many heuristic staticand dynamic loop scheduling techniqueshave been studied in the past. Here we propose a new static loop scheduling heuris-tic method called path-driven scheduling, under the as-sumption that the loop dependence graph has been gener-ated. This method clusters tasks according to the directed  paths on the dependence graph and assigns them to pro-cessors in the target architecture. We make comparisonswith the free scheduling and the refined free scheduling al-gorithms [8]. We schedule three widely used nested loopsona generalizedhypercubearchitecture. Ouralgorithmex-hibits the lowest communicationcost comparedto the other two algorithms, while the execution cost is the same for allthree algorithms. Key Words:  Loop scheduling, path-driven schedul-ing, path generation, path mapping, generalizedhypercube. 1 Introduction The problem of scheduling parallel program modulesontomultiprocessorcomputersisknowntobeNP-completein general cases [5]. Many heuristic scheduling algorithmshave been proposed in the literature [3], [5], [6], [8], [9],[10], [11], [12], [13]. Some researchers introduced prioritybased algorithms, such as list [6] and free scheduling [8].Most of these are suitable for shared memory machines.When these algorithms are applied to distributed memorymachines, performancewill degradequickly because of thecommunication cost. On distributed memory machines, amulti-stage method,  clustering scheduling , is more practi-cal [6]. It implements scheduling in two steps:  task al-location  and  task ordering . First task allocation clusterstasks according to  dominant sequences  on processors andmake task priority deterministic. Then task ordering couldbe achieved easily on the assigned processors.The motivation behind this work is to find out an effi-cient algorithm to schedule loop tasks on distributed mem-ory system. Given a loop task graph, how tasks are clus-tered will affect the whole performance of the scheduledprogram. Once the clustering is applied, we can not changemuchinglobalscheduling. Inclustering,justgroupinghighcommunication tasks together is not sufficient for reducingparallel execution time. Task execution order and globalparallel running time still should be taken into account.Directed paths on the graph reflect the traces of dataflowing streams. And clusters are the sets of nodes on atask graph. If we cluster tasks accordingto data flow (ordi-rectedpaths),actuallywe couldcombinethe considerationsof communication reducing and task execution ordering.This makes task allocation and ordering no longer distinctinternally. Within such clusters, each node contains exactlyone immediate predecessor and one immediate successor.So task ordering is optimized and naturally scheduling isthe consequence of data flow.We introduce a heuristic path-driven scheduling andmapping algorithm (P-D) in this paper to obtain the opti-mized task clusters according to data flows and map themontothe Processing Elements(PEs) of a target machine,thegeneralized hypercube [1], [4], [14], heuristically. The restofthepaperisorganizedasfollows. TheP-Dschedulingal-gorithm is described in Section 2. In Section 3, we discussthe mappingon the PEs of the generalizedhypercube. Sim-ulation results are given in Section 4. Finally conclusionsare presented in Section 5. 2 Path-Driven Scheduling A path-driven algorithm is suitable for problems withinvariant task computation cost. For such a nested loop  problem, since the P-D has taken the links into consider-ation, the communicationcost is not considered in the clus-tering phase but it is dealt with in the mapping phase. 2.1  Directed Acyclic Graph model for nested loops For a general program, we must detect the nestedloops and generate a directed acyclic graph (DAG) fromthem. Our scheduling will be based on the ensemble DAGof all loops.Throughout this paper, we consider n-way nestedloops with l  j   and u  j   as the lower and upper bounds of the  jth  loop. Without loss of generality, we assume l  j  and u  j   are integer-valued constants and l  j    u  j   for all 1    j    n  . The iteration space (or index set) is expressedby J  n  =  f  (  i  1  ;i  2  ;  ;i  n  )  j l  j    i  j    u  j   , for 1    j    n  g   .In the followingsections, we onlyconsiderloop-carriedde-pendences [5]. So we treat all statements at each iterationas a single element of  J  n   . And each iteration can be refer-enced by its index vector   i  =(  i  1  ;i  2  ;:::;i  n  )   .Fora data dependence,we use the followingterminol-ogy and notations:1.  Dependence vector:  If a variable  x  is defined orused at iteration   i   , and redefined or used at iteration   j   ,then there is a dependence vector   d   between these itera-tions based on the variable  x , where   d  =(  d  1  ;d  2  ;:::;d  n  )  t   , d  k  =  j  k  ?  i  k  , for 1    k    n   .Using dependence vectors, we can denote all data depen-dences among any iterations in this iteration space. Allthree types of dependences,  flow dependence ,  antidepen-dence  and  output dependence will be expressed in the sameway.2.  Dependence matrix: D   = [   d  1  ;   d  2  ;:::;   d  m   ], for m  2  Z  +  is the set of all dependence vectors in the itera-tion space.When parallelizing a nested loop, all dependenceshave to be respected. In a flow dependence, variables aredefined at an earlier iteration   i   and used at a later iteration   j  . So data must be passed from   i   to   j   . In an output de-pendence case, variables are defined at   i   and are redefinedat   j   . This just indicates a clear relation between the twoiterations. Their execution orders in a sequential programshould be respected in the parallel program scheduling. Inan antidependence case, variables used at   i   as input are re-defined at   j   (   i<    j   ). Actually antidependence is some kindof a restriction. Iteration   i   uses some variables defined be-fore   i   in sequential program. And iteration   j   overwritesthem. Thus no matter how we schedule iteration   i   , it has tobe executed before iteration   j   .To detect an antidependence, we just need to check the dependence vector   d   . Let d  i  0   be the first non-zero neg-ative entry, i.e., d  i  =0   , 1    i<i  0   , and d  i  0  <  0   , then thedependence vector   d   indicates an antidependence. Sinceantidependence deals with a relation between an iterationand a later iteration with redefinitions of some commonvariables, it has to be changed to indicate proper executionorder. One possible way is to convert it to a correspond-ing flow dependence. Then we can assume there exists apseudo-data flow between two iterations. To change the di-rection of dependence vector   d   , we replace it with a vector   d  0 =(  d  0 1  ;d  0 2  ;:::;d  0 n  )  t  , d  0 i  =  ?  d  i   , 1    i    n   . From nowon, we will not distinguish between flow, pseudo-flow oroutput dependence. We will just refer to them as depen-dences.From the iteration space and the modifieddependencematrix, we can generate a DAG G  (  V;E  )   with :Vertices V  = f  v  (  i  1  ;i  2  ;:::;i  n  )  j   i  =(  i  1  ;i  2  ;:::;i  n  )  ;l  l   i  l   u  l ;for  1    l    n  g  and edges E  = f  e  v  (  i  1  ;i  2  ;:::;i  n  ) ;v  (  j  1  ;j  2  ;:::;j  n  ) j   i  =(  i  1  ;i  2  ;:::;i  n  )  ;   j  =(  j  1  ;j  2  ;:::;j  n  )  ; such that 9    d  2  D   with   j  =   i  +   d  g  Each iteration is a vertex on the DAG, because weonly consider loop-carried dependences. If a dependenceexists between two iterations, there is an edge between thetwo corresponding vertices [8]. 2.2 Generation of Scheduled Paths The P-D schedules and maps tasks based on certainparameters. Let u;v  2  V   and the set of positive integers isdenoted by Z  +   , we define the following:1.  Predecessor Set:  pre  (  v  )=  f  u  j e  u;v  2  E  g  2.  Successor Set: suc  (  v  )=  f  u  j e  v;u  2  E  g  3.  Earliest Schedule Level: esl  (  v  )=    1  if   pre  (  v  )=    max  u  2  pre  (  v  ) (  esl  (  u  ))+1  otherwise4.  Graph Path:  A set of vertices which are connected bya sequence of edges on the DAG.5.  Critical Path Length: cpl  (  G  (  V;E  ))=max  v  2  V  (  esl  (  v  ))  Critical paths are the longest graph paths on the DAG.They dominate the execution time. The critical pathlength is at least the same as the loop parallel time.6.  Latest Schedule Level: lsl  (  v  )=    cpl  (  G  (  V;E  ))  ; if  suc  (  v  )=    min  u  2  suc  (  v  ) (  lsl  (  u  ))  ?  1  ; otherwise7.  Task Priority Value:  Each task is assigned a priorityvalue    for selection in clustering. Smaller values in-dicate higher priority. For v  2  V   ,   (  v  )=  lsl  (  v  )  ?  esl  (  v  )  :  8.  Task LevelSet:  Tasksareclassifiedbylevelsaccordingto their  Earliest ScheduleLevel . All tasks which couldbe executed in parallel at time slot i  2  Z  +   will bescheduled at tls  i   , where tls  i  =  f  v  j esl  (  v  )=  i;v  2  V;i  2  Z  +  g  : 9.  Scheduled Path (sp):  Each sp   is a set of vertices se-lected by the P-D. Each vertex of the DAG belongs toonly one sp   . Critical Scheduled Paths ( csp   ) are thelongest sp  0 s   . Tasks of each sp   will be mapped to aparticular PE for execution.10.  Path Communication Set (pcs):  Each scheduled pathcommunicates with others through this set of edges.For sp  i   , i  2  Z  +   ,  pcs  i  =  f  e  u;v  j e  u;v  2  E; (  u  2  sp  i  and v  2  V  ?  sp  i  )  jj (  v  2  sp  i  and u  2  V  ?  sp  i  )  g  : 11.  Path Exchange Set:  Two scheduled paths might needto exchange data through a set of edges just betweenthem. For sp  i  ;sp  j   , i;j  2  Z  +   ,  pes  i;j  =  f  e  u;v  j e  u;v  2  E; (  u  2  sp  i  and v  2  sp  j  )  jj (  v  2  sp  i  and u  2  sp  j  )  g  : For a given DAG, the P-D assigns tasks to differentPEs and sets up the execution order. Generally, a task scheduling algorithm consists of three steps [6]:1. Partition the DAG into a set of distinct task clusters.2. Reorder the tasks inside their clusters.3. Map the clusters to PEs maintaining communicationefficiency.Scheduled paths in the P-D are similar to task clus-ters. The difference is that tasks on scheduled paths musthave data dependence relations, but this is not necessarilytrue for the tasks in clusters. Scheduled paths are generatedaccording to data dependences. Initially a scheduled pathis a trace of data flow. Data stream passes through it fromthe start to the end. And at the same time, task orders havebeen fixed on the sp   because of the dependences. Duringthe mapping, the costs of all communication links on thesame scheduled paths have been zeroed. So internally (oneach PE) the three separate scheduling steps are collapsedinto one step.Given a DAG, the length of critical scheduled pathswill be constant regardless how they might be generated.Longer scheduled paths contain more tasks, and possiblymore communicationlinks since they are generated by datadependences. In general if fewer PEs are used and if thescheduled paths are longer, the expected efficiency of theexecution will be higher.Scheduled paths are created to cover all the tasks ina DAG. Some tasks are shared by several graph paths, asin a DAG there are some joint nodes, (e.g. fork and joinpoints). These shared tasks couldonlybelongto one sched-uled path. Once a scheduled path is extracted, those task vertices will not be isolated and cut off from the DAG as inlinear clustering methods [10]. They are sharable and stayon the DAG to enable data flows passing through for thedetection of other scheduled paths. For example, if the par-ent task is an unprocessed task, and its immediate childrenhave been assigned to some scheduled paths by their otherparents, these selected children could be used as pseudo-tasks on the new scheduledpath to let the parent find its un-processed grand-children and place them on the same newscheduled path. Both parent and its grand-children havedata links with the tasks in between (parent’s children).Puttingthemonthesamescheduledpathwill benefitthefu-ture mapping in reducing communication distance. So theP-D could generate longer scheduled paths naturally andreduce the potential difficulty in merging them afterwards.One could use heuristic methods in getting a subopti-mal scheduled path. For the safest solution, we should enu-merate all possible graph paths, then pick out the scheduledpaths in decreasing order of their lengths. But actually thismethod is not practical. Both time and space costs are ashigh as task duplication scheduling. We propose a heuristicstrategy of scheduled path generation as follows:1.  Determine the earliest schedule levels:  Traversethe DAG from top down to determine the earliest schedul-ing levels ( esl   ) of all tasks. If a task is independent, its esl  =1  . If a task only depends on tasks with esl  =1   ,its esl  =2   , and so on. If a task depends on some taskswith esl    i   (i.e. at least one task has esl  =  i   ), this task’s esl  =  i  +1  .2.  Determine the latest schedule levels:  The taskswith the biggest esl   are actually the exit-nodes on criticalscheduled paths. Then lsl   (task) = esl   (task). From bottomup, wetraverseDAG againto determinethe lsl   forall tasks.For a task, if its successors exist with lsl    i   (at least onetask has lsl   (task) = i   ), this task’s lsl  =  i  ?  1   .3.  Group tasks:  Tasks should be grouped and placedinto different task level sets ( tls   ). This is done according to esl  ,  i.e. , if a task’s esl  =  i   , it’s placed in tls  i   . Those taskswhich have the same values of  esl   and lsl   are called  Criti-cal Path Tasks  ( cpt   ). Each tls  i   contains at least one cpt   .4.  Generate scheduled paths:  All critical scheduledpaths are identified before the non-critical scheduled paths.There could be several critical scheduled paths. First, wescan the  task level set   table from top down (i.e., from tls  1  to tls  cpl  ) to select the start of a scheduled path, then iden-tify other tasks from data dependences. If  tls  i   is not empty,we choose one critical path task at random. If none is avail-able, we choose one with smallest priority value    (whichimplies it belongs to a longer scheduled path), and mark it.Thenwecheckthesuccessorsoftheselectedtaskandselectan unmarked task with the smallest    . If none is available,we choose one as a pseudo-task, and check its successorsuntil exit-tasks are reached. We mark the whole scheduledpath, and then restart to generate another one until all taskshave been marked.  In this strategy, we can see that each scheduled pathcontains at least one task. There might be some redun-dant pseudo-tasks. They help to detect more tasks alongthe same data streams. Once a scheduledpath is created, itstasks could be eliminated from the scheduling record, butnot from the DAG. Algorithm (Scheduled Path Generation) Input: DAG G  (  V;E  )   with tls  (  i  )  ; 1    i    cpl  and   (  j  )  ; 1    j  j V  j Output: List sp  1  ;sp  2  ;  ;sp  m   , for some m    1  1. m  =0  2.  for i  =1   to cpl  3.  while  not all tasks in tls  i   are marked “selected”  do 4. m  =  m  +1; sp  m  =    5. select v   such that   (  v  )=min  f    (  u  )  ; 8  u  2  tls  i  and u   is unmarked g  6. sp  m  =  sp  m  f  v  g   , and mark  v   “selected”7. j  =  i  8.  while j<cpl   do 9.  if  9  u  2  suc  (  v  )   and u   is unmarked10.  then  select unmarked x   ,   (  x  )=min  f    (  y  )   , 8  unmarked y  2  suc  (  v  )  g  11. sp  m  =  sp  m  f  x  g  12. mark  x   “selected”13.  else  select a marked x  2  suc  (  v  )   at random14.  endif  15. v  =  x  16. j  =  j  +1  17.  endwhile 18.  endwhile 19. endfor Task time-stamping : Once we get all scheduled paths,we know which sp  0 s   tasks belongs to and their sequentialorders. But we still need to assign each task a time-stampto indicate at which step it could be run on a particular PE.From the length of  csp  0 s   , we know the number of parallelexecution steps. Then along each sp   , there exists the samenumber of   execution time slots . Each  time slot   can containonly one task. Task time-stamping is needed to place eachtask into a reasonable time slot on its sp   according to thedependences in the DAG. The esl  0 s   generated during thescheduled path generation are used for the time-stampingto fill out the  time slots . Scheduled path optimization : Up to now, a sp   is atrace of data flow. If it is not contiguous, it must be shar-ing some tasks with others and the shared tasks are not onthe current sp   . It’s costly to let all sp  0 s   always find someavailable tasks beyond those shared ones during the sched-uled path generation period. Then several broken shorterscheduled paths could be created.If each task on a sp  i   is scheduledat an earlier esl   thanall those on a sp  j   , these two scheduled paths will be goodcandidates for merging. If the last task on sp  i   is connectedto the first task on sp  j   through some other vertices in theDAG, we call them  related sp’s . The benefits of mergingthem are:   To reduce the number of PEs for higher efficiency.   To make mapping easier for a non-fully connectedtopology which might not be rich in communicationchannels. Merging scheduled paths can reduce suchrequests because the shared tasks need to exchangedata with other tasks on both srcinal shorter sp  0 s   .For general cases, any two scheduled paths whosetasks are on different lsl   , i.e. with different esl   , can bemerged together. But this doesn’t guarantee that they are related sp’s . If they are  non-related sp’s , they can not al-ways reduce the communicationcost because it dependsonthe underlyingnon-fullyconnectedtopology. This problemneeds further investigation. 3 Mapping on Target Machine In theory, all sp  0 s   can be mapped onto any PEson a target physical machine. In a shared memory ma-chine model, the underlying PE network topology could bethought to be fully connected. Each PE is similar to allothers regarding the communication links and locations. Inthis case, mapping is easy. Any scheduled path could bemapped onto any PE.For a distributed memory machine, communicationlatency plays a big role. Different mappings might resultin different communication costs, and make longer the to-tal parallel execution time. The heuristic mapping tries tomap scheduledpathsto nearbyPEs if they share more com-munication links. 3.1 Detection of Scheduled Path Relationship The communication frequency of  sp  i   with all other sp  0 s  will be denoted by j  pcs  i  j  . This indicates the numberof links shared between sp  i   and all others. The sp   withbigger j  pcs  i  j  should be mapped earlier than others. Thiscould make easier the mapping of subsequent sp  0 s   .For sp  i   and sp  j   , the communication frequency be-tween them will be denoted by j  pes  i;j  j  . Then sp  0 s   withbigger j  pes  i;j  j  should be mapped to nearby PEs in order toreduce the communication cost between them.During mapping, these communication frequenciesare used to determine the mapping order and location. 3.2 Generalized Hypercube TheGeneralizedHypercubetopologyarchitecturehasbeen studied in [1], [14]. Due to its high degree of node  (0,0)(1,0)(0,3)(0,2)(0,1)(3,1)(3,0)(2,3)(2,2)(2,1)(2,0) (1,3)(1,2)(1,1)(3,3)(3,2) Figure 1: The 2-D generalized hypercube GH  (2  ; 4) connectivity, it offers a viable alternative to the sharedmemory and other distributed memory architectures. Manypast and current massively parallel computers are based onmeshes or k-ary n-cubes (e.g. Cray T3E, Intel Paragon,Tera and the design in [2] ). Unlike the mesh or k-aryn-cubes, the generalized hypercubes ( GH  (  n;k  )  , where:n= number of dimensions and k= number of nodes in eachdimension ) have k   fully interconnected nodes in each di-mension. As a result they have a very low diameter and avery high bisection width. However, the number of inter-connection links in each dimension increases linearly withthe number of nodes k   [1].GH is a symmetric topology. All PEs have the samenumbers and structures of links with others. Figure 1 illus-trates the 2-D GH  (  n;k  )  , with n  =2   and k  =4   . The diam-eter of a 2-D GH  (  n;k  )  is only 2 and the bisection width is k  3  =  4  . 3.3 Mapping on GH We choose the 2-D generalized hypercube ( GH  (2  ;k  )  )asanetworktopologymodeltoshowhowtomapscheduledpaths to a physical machine. Between two PEs on the samedimension of  GH  (2  ;k  )  , each communication takes a singletime unit. Otherwise, it takes twice this time.We distinguish two cases: (a) n  p   (number of paths)   p (number of PEs) (b) n  p  >p   . In (b) several sched-uled paths are mapped onto each PE, which results in zero-ing all communicationlinks between them. But the paralleltime will be longer. In such cases, the mapping is split intoglobal and local mappings. Each PE has several scheduledpath slots. Initially, the global mapping is applied to mapa scheduled path on a PE by selecting one with the biggest j  pcs  i  j . Then for the  local mapping , j  pes  i;j  j  should be usedto find an unselected scheduled path which has the mostcommunication links with the already mapped ones on thecurrent PE, and map the new one to a spare slot. This localmapping process like this should continue until the localslots are full or locally mapped scheduled paths have nocommunication need with any others.Because of the peculiarity of the GH  (2  ;k  )  , the globalmapping strategy could be formulated as the following al-gorithm, whereas the local mapping remains the same. Toillustrate the global mapping algorithm, some terminologyfor GH  (2  ;k  )  is introduced.    Link CountersforRows(orColumns):  Foreachnewlyselected sp  i   , each GH row (or column) maintains aparameter lcr  i    j  ]  (or lcc  i    j  ]  ) for the total numbers of links between sp  i   and all scheduled paths which havebeen assigned on these rows (or columns). For eachGH row (or column) j   , lcr  i    j  ]=  X  8  sp  l mappedonrow  j  j  pes  i;l j lcc  i    j  ]=  X  8  sp  l mappedoncolumn  j  j  pes  i;l j We next present the  global mapping  strategy on the2-D GH.1. Sort sp  0 s   decreasingly according to j  pcs  i  j  .2. Select an unselected sp  i   with the biggest j  pcs  i  j  .3. Sort GH rows and columns according to lcr  i   and lcc  i   .4. Find an available GH node (  m  0  ;n  0  )   such that lcr  i    m  0  ]+  lcc  i    n  0  ]=max  f  lcr  i    m  ]+  lcc  i    n  ] , for anyGH node (  m;n  )  g  5. Map the selected sp  i   to the GH node (  m  0  ;n  0  )   .6. Activate the local mapping process to map more sp  0 s  to available local slots.7. Repeat step 2, 3, 4, and 6 until all sp  0 s   have beenmapped. 4 Simulated Experiments In this paper, the parallel computation time is deter-mined by the critical path length unless the target machineruns out of PEs. The performance of the algorithms willbe evaluated by the  number of inter-communication links ( nicl   ) among tasks on PEs. For a fully connected topology, nicl  is just the number of links among all scheduled paths.Fora real target machine, nicl   is the summationof commu-nication costs among sp  0 s   on PEs expressed as the sum of the total numbers of hops.For the 2-D GH, on each dimension, PE nodes arefully connected. Any communication along the same di-mension will have weight = 1. Other communication be-tweentwo scheduledpathswithdifferentrowsandcolumnswill have weight = 2. The summation of them will indicatethe total communication cost.Next, free scheduling, refined free scheduling andpath-driven scheduling are applied to a simple nested loopexample, the matrix multiplication and the Jacobi loops.The comparison of  nicl  0 s   will be reported on the fully con-nected topology and 2-D GH.
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks