Description

A Path-Driven Loop Scheduling Mapped onto Generalized Hypercubes (PLEASE REFERENCE IN YOUR PAPERS)

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

A Path-Driven Loop Scheduling Mapped onto Generalized Hypercubes
Hai Jiang (Member, IEEE),A. T. Chronopoulos (Senior Member, IEEE)Department of Computer ScienceWayne State University,University of Texas, San Antoniohaj@cs.wayne.edu,atc@cs.utsa.eduG. Papakonstantinou,P. TsanakasDept. of Electrical and Computer EngineeringNational Technical University of AthensAthens, Greece
Abstract
One of the important issues in automatic code par-allelization is the scheduling and mapping of nested loopiterations to different processors. The optimal scheduling problem is known to be NP-complete. Many heuristic staticand dynamic loop scheduling techniqueshave been studied in the past. Here we propose a new static loop scheduling heuris-tic method called path-driven scheduling, under the as-sumption that the loop dependence graph has been gener-ated. This method clusters tasks according to the directed paths on the dependence graph and assigns them to pro-cessors in the target architecture. We make comparisonswith the free scheduling and the reﬁned free scheduling al-gorithms [8]. We schedule three widely used nested loopsona generalizedhypercubearchitecture. Ouralgorithmex-hibits the lowest communicationcost comparedto the other two algorithms, while the execution cost is the same for allthree algorithms.
Key Words:
Loop scheduling, path-driven schedul-ing, path generation, path mapping, generalizedhypercube.
1 Introduction
The problem of scheduling parallel program modulesontomultiprocessorcomputersisknowntobeNP-completein general cases [5]. Many heuristic scheduling algorithmshave been proposed in the literature [3], [5], [6], [8], [9],[10], [11], [12], [13]. Some researchers introduced prioritybased algorithms, such as list [6] and free scheduling [8].Most of these are suitable for shared memory machines.When these algorithms are applied to distributed memorymachines, performancewill degradequickly because of thecommunication cost. On distributed memory machines, amulti-stage method,
clustering scheduling
, is more practi-cal [6]. It implements scheduling in two steps:
task al-location
and
task ordering
. First task allocation clusterstasks according to
dominant sequences
on processors andmake task priority deterministic. Then task ordering couldbe achieved easily on the assigned processors.The motivation behind this work is to ﬁnd out an efﬁ-cient algorithm to schedule loop tasks on distributed mem-ory system. Given a loop task graph, how tasks are clus-tered will affect the whole performance of the scheduledprogram. Once the clustering is applied, we can not changemuchinglobalscheduling. Inclustering,justgroupinghighcommunication tasks together is not sufﬁcient for reducingparallel execution time. Task execution order and globalparallel running time still should be taken into account.Directed paths on the graph reﬂect the traces of dataﬂowing streams. And clusters are the sets of nodes on atask graph. If we cluster tasks accordingto data ﬂow (ordi-rectedpaths),actuallywe couldcombinethe considerationsof communication reducing and task execution ordering.This makes task allocation and ordering no longer distinctinternally. Within such clusters, each node contains exactlyone immediate predecessor and one immediate successor.So task ordering is optimized and naturally scheduling isthe consequence of data ﬂow.We introduce a heuristic path-driven scheduling andmapping algorithm (P-D) in this paper to obtain the opti-mized task clusters according to data ﬂows and map themontothe Processing Elements(PEs) of a target machine,thegeneralized hypercube [1], [4], [14], heuristically. The restofthepaperisorganizedasfollows. TheP-Dschedulingal-gorithm is described in Section 2. In Section 3, we discussthe mappingon the PEs of the generalizedhypercube. Sim-ulation results are given in Section 4. Finally conclusionsare presented in Section 5.
2 Path-Driven Scheduling
A path-driven algorithm is suitable for problems withinvariant task computation cost. For such a nested loop
problem, since the P-D has taken the links into consider-ation, the communicationcost is not considered in the clus-tering phase but it is dealt with in the mapping phase.
2.1
Directed Acyclic Graph model for nested loops
For a general program, we must detect the nestedloops and generate a directed acyclic graph (DAG) fromthem. Our scheduling will be based on the ensemble DAGof all loops.Throughout this paper, we consider n-way nestedloops with
l
j
and
u
j
as the lower and upper bounds of the
jth
loop. Without loss of generality, we assume
l
j
and
u
j
are integer-valued constants and
l
j
u
j
for all
1
j
n
. The iteration space (or index set) is expressedby
J
n
=
f
(
i
1
;i
2
;
;i
n
)
j
l
j
i
j
u
j
, for
1
j
n
g
.In the followingsections, we onlyconsiderloop-carriedde-pendences [5]. So we treat all statements at each iterationas a single element of
J
n
. And each iteration can be refer-enced by its index vector
i
=(
i
1
;i
2
;:::;i
n
)
.Fora data dependence,we use the followingterminol-ogy and notations:1.
Dependence vector:
If a variable
x
is deﬁned orused at iteration
i
, and redeﬁned or used at iteration
j
,then there is a dependence vector
d
between these itera-tions based on the variable
x
, where
d
=(
d
1
;d
2
;:::;d
n
)
t
,
d
k
=
j
k
?
i
k
, for
1
k
n
.Using dependence vectors, we can denote all data depen-dences among any iterations in this iteration space. Allthree types of dependences,
ﬂow dependence
,
antidepen-dence
and
output dependence
will be expressed in the sameway.2.
Dependence matrix:
D
= [
d
1
;
d
2
;:::;
d
m
], for
m
2
Z
+
is the set of all dependence vectors in the itera-tion space.When parallelizing a nested loop, all dependenceshave to be respected. In a ﬂow dependence, variables aredeﬁned at an earlier iteration
i
and used at a later iteration
j
. So data must be passed from
i
to
j
. In an output de-pendence case, variables are deﬁned at
i
and are redeﬁnedat
j
. This just indicates a clear relation between the twoiterations. Their execution orders in a sequential programshould be respected in the parallel program scheduling. Inan antidependence case, variables used at
i
as input are re-deﬁned at
j
(
i<
j
). Actually antidependence is some kindof a restriction. Iteration
i
uses some variables deﬁned be-fore
i
in sequential program. And iteration
j
overwritesthem. Thus no matter how we schedule iteration
i
, it has tobe executed before iteration
j
.To detect an antidependence, we just need to check the dependence vector
d
. Let
d
i
0
be the ﬁrst non-zero neg-ative entry, i.e.,
d
i
=0
,
1
i<i
0
, and
d
i
0
<
0
, then thedependence vector
d
indicates an antidependence. Sinceantidependence deals with a relation between an iterationand a later iteration with redeﬁnitions of some commonvariables, it has to be changed to indicate proper executionorder. One possible way is to convert it to a correspond-ing ﬂow dependence. Then we can assume there exists apseudo-data ﬂow between two iterations. To change the di-rection of dependence vector
d
, we replace it with a vector
d
0
=(
d
0
1
;d
0
2
;:::;d
0
n
)
t
,
d
0
i
=
?
d
i
,
1
i
n
. From nowon, we will not distinguish between ﬂow, pseudo-ﬂow oroutput dependence. We will just refer to them as depen-dences.From the iteration space and the modiﬁeddependencematrix, we can generate a DAG
G
(
V;E
)
with :Vertices
V
=
f
v
(
i
1
;i
2
;:::;i
n
)
j
i
=(
i
1
;i
2
;:::;i
n
)
;l
l
i
l
u
l
;for
1
l
n
g
and edges
E
=
f
e
v
(
i
1
;i
2
;:::;i
n
)
;v
(
j
1
;j
2
;:::;j
n
)
j
i
=(
i
1
;i
2
;:::;i
n
)
;
j
=(
j
1
;j
2
;:::;j
n
)
;
such that
9
d
2
D
with
j
=
i
+
d
g
Each iteration is a vertex on the DAG, because weonly consider loop-carried dependences. If a dependenceexists between two iterations, there is an edge between thetwo corresponding vertices [8].
2.2 Generation of Scheduled Paths
The P-D schedules and maps tasks based on certainparameters. Let
u;v
2
V
and the set of positive integers isdenoted by
Z
+
, we deﬁne the following:1.
Predecessor Set:
pre
(
v
)=
f
u
j
e
u;v
2
E
g
2.
Successor Set:
suc
(
v
)=
f
u
j
e
v;u
2
E
g
3.
Earliest Schedule Level:
esl
(
v
)=
1
if
pre
(
v
)=
max
u
2
pre
(
v
)
(
esl
(
u
))+1
otherwise4.
Graph Path:
A set of vertices which are connected bya sequence of edges on the DAG.5.
Critical Path Length:
cpl
(
G
(
V;E
))=max
v
2
V
(
esl
(
v
))
Critical paths are the longest graph paths on the DAG.They dominate the execution time. The critical pathlength is at least the same as the loop parallel time.6.
Latest Schedule Level:
lsl
(
v
)=
cpl
(
G
(
V;E
))
;
if
suc
(
v
)=
min
u
2
suc
(
v
)
(
lsl
(
u
))
?
1
;
otherwise7.
Task Priority Value:
Each task is assigned a priorityvalue
for selection in clustering. Smaller values in-dicate higher priority. For
v
2
V
,
(
v
)=
lsl
(
v
)
?
esl
(
v
)
:
8.
Task LevelSet:
Tasksareclassiﬁedbylevelsaccordingto their
Earliest ScheduleLevel
. All tasks which couldbe executed in parallel at time slot
i
2
Z
+
will bescheduled at
tls
i
, where
tls
i
=
f
v
j
esl
(
v
)=
i;v
2
V;i
2
Z
+
g
:
9.
Scheduled Path (sp):
Each
sp
is a set of vertices se-lected by the P-D. Each vertex of the DAG belongs toonly one
sp
. Critical Scheduled Paths (
csp
) are thelongest
sp
0
s
. Tasks of each
sp
will be mapped to aparticular PE for execution.10.
Path Communication Set (pcs):
Each scheduled pathcommunicates with others through this set of edges.For
sp
i
,
i
2
Z
+
,
pcs
i
=
f
e
u;v
j
e
u;v
2
E;
(
u
2
sp
i
and
v
2
V
?
sp
i
)
jj
(
v
2
sp
i
and
u
2
V
?
sp
i
)
g
:
11.
Path Exchange Set:
Two scheduled paths might needto exchange data through a set of edges just betweenthem. For
sp
i
;sp
j
,
i;j
2
Z
+
,
pes
i;j
=
f
e
u;v
j
e
u;v
2
E;
(
u
2
sp
i
and
v
2
sp
j
)
jj
(
v
2
sp
i
and
u
2
sp
j
)
g
:
For a given DAG, the P-D assigns tasks to differentPEs and sets up the execution order. Generally, a task scheduling algorithm consists of three steps [6]:1. Partition the DAG into a set of distinct task clusters.2. Reorder the tasks inside their clusters.3. Map the clusters to PEs maintaining communicationefﬁciency.Scheduled paths in the P-D are similar to task clus-ters. The difference is that tasks on scheduled paths musthave data dependence relations, but this is not necessarilytrue for the tasks in clusters. Scheduled paths are generatedaccording to data dependences. Initially a scheduled pathis a trace of data ﬂow. Data stream passes through it fromthe start to the end. And at the same time, task orders havebeen ﬁxed on the
sp
because of the dependences. Duringthe mapping, the costs of all communication links on thesame scheduled paths have been zeroed. So internally (oneach PE) the three separate scheduling steps are collapsedinto one step.Given a DAG, the length of critical scheduled pathswill be constant regardless how they might be generated.Longer scheduled paths contain more tasks, and possiblymore communicationlinks since they are generated by datadependences. In general if fewer PEs are used and if thescheduled paths are longer, the expected efﬁciency of theexecution will be higher.Scheduled paths are created to cover all the tasks ina DAG. Some tasks are shared by several graph paths, asin a DAG there are some joint nodes, (e.g. fork and joinpoints). These shared tasks couldonlybelongto one sched-uled path. Once a scheduled path is extracted, those task vertices will not be isolated and cut off from the DAG as inlinear clustering methods [10]. They are sharable and stayon the DAG to enable data ﬂows passing through for thedetection of other scheduled paths. For example, if the par-ent task is an unprocessed task, and its immediate childrenhave been assigned to some scheduled paths by their otherparents, these selected children could be used as pseudo-tasks on the new scheduledpath to let the parent ﬁnd its un-processed grand-children and place them on the same newscheduled path. Both parent and its grand-children havedata links with the tasks in between (parent’s children).Puttingthemonthesamescheduledpathwill beneﬁtthefu-ture mapping in reducing communication distance. So theP-D could generate longer scheduled paths naturally andreduce the potential difﬁculty in merging them afterwards.One could use heuristic methods in getting a subopti-mal scheduled path. For the safest solution, we should enu-merate all possible graph paths, then pick out the scheduledpaths in decreasing order of their lengths. But actually thismethod is not practical. Both time and space costs are ashigh as task duplication scheduling. We propose a heuristicstrategy of scheduled path generation as follows:1.
Determine the earliest schedule levels:
Traversethe DAG from top down to determine the earliest schedul-ing levels (
esl
) of all tasks. If a task is independent, its
esl
=1
. If a task only depends on tasks with
esl
=1
,its
esl
=2
, and so on. If a task depends on some taskswith
esl
i
(i.e. at least one task has
esl
=
i
), this task’s
esl
=
i
+1
.2.
Determine the latest schedule levels:
The taskswith the biggest
esl
are actually the exit-nodes on criticalscheduled paths. Then
lsl
(task) =
esl
(task). From bottomup, wetraverseDAG againto determinethe
lsl
forall tasks.For a task, if its successors exist with
lsl
i
(at least onetask has
lsl
(task) =
i
), this task’s
lsl
=
i
?
1
.3.
Group tasks:
Tasks should be grouped and placedinto different task level sets (
tls
). This is done according to
esl
,
i.e.
, if a task’s
esl
=
i
, it’s placed in
tls
i
. Those taskswhich have the same values of
esl
and
lsl
are called
Criti-cal Path Tasks
(
cpt
). Each
tls
i
contains at least one
cpt
.4.
Generate scheduled paths:
All critical scheduledpaths are identiﬁed before the non-critical scheduled paths.There could be several critical scheduled paths. First, wescan the
task level set
table from top down (i.e., from
tls
1
to
tls
cpl
) to select the start of a scheduled path, then iden-tify other tasks from data dependences. If
tls
i
is not empty,we choose one critical path task at random. If none is avail-able, we choose one with smallest priority value
(whichimplies it belongs to a longer scheduled path), and mark it.Thenwecheckthesuccessorsoftheselectedtaskandselectan unmarked task with the smallest
. If none is available,we choose one as a pseudo-task, and check its successorsuntil exit-tasks are reached. We mark the whole scheduledpath, and then restart to generate another one until all taskshave been marked.
In this strategy, we can see that each scheduled pathcontains at least one task. There might be some redun-dant pseudo-tasks. They help to detect more tasks alongthe same data streams. Once a scheduledpath is created, itstasks could be eliminated from the scheduling record, butnot from the DAG.
Algorithm (Scheduled Path Generation)
Input: DAG
G
(
V;E
)
with
tls
(
i
)
;
1
i
cpl
and
(
j
)
;
1
j
j
V
j
Output: List
sp
1
;sp
2
;
;sp
m
, for some
m
1
1.
m
=0
2.
for
i
=1
to
cpl
3.
while
not all tasks in
tls
i
are marked “selected”
do
4.
m
=
m
+1;
sp
m
=
5. select
v
such that
(
v
)=min
f
(
u
)
;
8
u
2
tls
i
and
u
is unmarked
g
6.
sp
m
=
sp
m
f
v
g
, and mark
v
“selected”7.
j
=
i
8.
while
j<cpl
do
9.
if
9
u
2
suc
(
v
)
and
u
is unmarked10.
then
select unmarked
x
,
(
x
)=min
f
(
y
)
,
8
unmarked
y
2
suc
(
v
)
g
11.
sp
m
=
sp
m
f
x
g
12. mark
x
“selected”13.
else
select a marked
x
2
suc
(
v
)
at random14.
endif
15.
v
=
x
16.
j
=
j
+1
17.
endwhile
18.
endwhile
19.
endfor
Task time-stamping
: Once we get all scheduled paths,we know which
sp
0
s
tasks belongs to and their sequentialorders. But we still need to assign each task a time-stampto indicate at which step it could be run on a particular PE.From the length of
csp
0
s
, we know the number of parallelexecution steps. Then along each
sp
, there exists the samenumber of
execution time slots
. Each
time slot
can containonly one task. Task time-stamping is needed to place eachtask into a reasonable time slot on its
sp
according to thedependences in the DAG. The
esl
0
s
generated during thescheduled path generation are used for the time-stampingto ﬁll out the
time slots
.
Scheduled path optimization
: Up to now, a
sp
is atrace of data ﬂow. If it is not contiguous, it must be shar-ing some tasks with others and the shared tasks are not onthe current
sp
. It’s costly to let all
sp
0
s
always ﬁnd someavailable tasks beyond those shared ones during the sched-uled path generation period. Then several broken shorterscheduled paths could be created.If each task on a
sp
i
is scheduledat an earlier
esl
thanall those on a
sp
j
, these two scheduled paths will be goodcandidates for merging. If the last task on
sp
i
is connectedto the ﬁrst task on
sp
j
through some other vertices in theDAG, we call them
related sp’s
. The beneﬁts of mergingthem are:
To reduce the number of PEs for higher efﬁciency.
To make mapping easier for a non-fully connectedtopology which might not be rich in communicationchannels. Merging scheduled paths can reduce suchrequests because the shared tasks need to exchangedata with other tasks on both srcinal shorter
sp
0
s
.For general cases, any two scheduled paths whosetasks are on different
lsl
, i.e. with different
esl
, can bemerged together. But this doesn’t guarantee that they are
related sp’s
. If they are
non-related sp’s
, they can not al-ways reduce the communicationcost because it dependsonthe underlyingnon-fullyconnectedtopology. This problemneeds further investigation.
3 Mapping on Target Machine
In theory, all
sp
0
s
can be mapped onto any PEson a target physical machine. In a shared memory ma-chine model, the underlying PE network topology could bethought to be fully connected. Each PE is similar to allothers regarding the communication links and locations. Inthis case, mapping is easy. Any scheduled path could bemapped onto any PE.For a distributed memory machine, communicationlatency plays a big role. Different mappings might resultin different communication costs, and make longer the to-tal parallel execution time. The heuristic mapping tries tomap scheduledpathsto nearbyPEs if they share more com-munication links.
3.1 Detection of Scheduled Path Relationship
The communication frequency of
sp
i
with all other
sp
0
s
will be denoted by
j
pcs
i
j
. This indicates the numberof links shared between
sp
i
and all others. The
sp
withbigger
j
pcs
i
j
should be mapped earlier than others. Thiscould make easier the mapping of subsequent
sp
0
s
.For
sp
i
and
sp
j
, the communication frequency be-tween them will be denoted by
j
pes
i;j
j
. Then
sp
0
s
withbigger
j
pes
i;j
j
should be mapped to nearby PEs in order toreduce the communication cost between them.During mapping, these communication frequenciesare used to determine the mapping order and location.
3.2 Generalized Hypercube
TheGeneralizedHypercubetopologyarchitecturehasbeen studied in [1], [14]. Due to its high degree of node
(0,0)(1,0)(0,3)(0,2)(0,1)(3,1)(3,0)(2,3)(2,2)(2,1)(2,0)
(1,3)(1,2)(1,1)(3,3)(3,2)
Figure 1: The 2-D generalized hypercube
GH
(2
;
4)
connectivity, it offers a viable alternative to the sharedmemory and other distributed memory architectures. Manypast and current massively parallel computers are based onmeshes or k-ary n-cubes (e.g. Cray T3E, Intel Paragon,Tera and the design in [2] ). Unlike the mesh or k-aryn-cubes, the generalized hypercubes (
GH
(
n;k
)
, where:n= number of dimensions and k= number of nodes in eachdimension ) have
k
fully interconnected nodes in each di-mension. As a result they have a very low diameter and avery high bisection width. However, the number of inter-connection links in each dimension increases linearly withthe number of nodes
k
[1].GH is a symmetric topology. All PEs have the samenumbers and structures of links with others. Figure 1 illus-trates the 2-D
GH
(
n;k
)
, with
n
=2
and
k
=4
. The diam-eter of a 2-D
GH
(
n;k
)
is only 2 and the bisection width is
k
3
=
4
.
3.3 Mapping on GH
We choose the 2-D generalized hypercube (
GH
(2
;k
)
)asanetworktopologymodeltoshowhowtomapscheduledpaths to a physical machine. Between two PEs on the samedimension of
GH
(2
;k
)
, each communication takes a singletime unit. Otherwise, it takes twice this time.We distinguish two cases: (a)
n
p
(number of paths)
p (number of PEs) (b)
n
p
>p
. In (b) several sched-uled paths are mapped onto each PE, which results in zero-ing all communicationlinks between them. But the paralleltime will be longer. In such cases, the mapping is split intoglobal and local mappings. Each PE has several scheduledpath slots. Initially, the global mapping is applied to mapa scheduled path on a PE by selecting one with the biggest
j
pcs
i
j
. Then for the
local mapping
,
j
pes
i;j
j
should be usedto ﬁnd an unselected scheduled path which has the mostcommunication links with the already mapped ones on thecurrent PE, and map the new one to a spare slot. This localmapping process like this should continue until the localslots are full or locally mapped scheduled paths have nocommunication need with any others.Because of the peculiarity of the
GH
(2
;k
)
, the globalmapping strategy could be formulated as the following al-gorithm, whereas the local mapping remains the same. Toillustrate the global mapping algorithm, some terminologyfor
GH
(2
;k
)
is introduced.
Link CountersforRows(orColumns):
Foreachnewlyselected
sp
i
, each GH row (or column) maintains aparameter
lcr
i
j
]
(or
lcc
i
j
]
) for the total numbers of links between
sp
i
and all scheduled paths which havebeen assigned on these rows (or columns). For eachGH row (or column)
j
,
lcr
i
j
]=
X
8
sp
l
mappedonrow
j
j
pes
i;l
j
lcc
i
j
]=
X
8
sp
l
mappedoncolumn
j
j
pes
i;l
j
We next present the
global mapping
strategy on the2-D GH.1. Sort
sp
0
s
decreasingly according to
j
pcs
i
j
.2. Select an unselected
sp
i
with the biggest
j
pcs
i
j
.3. Sort GH rows and columns according to
lcr
i
and
lcc
i
.4. Find an available GH node
(
m
0
;n
0
)
such that
lcr
i
m
0
]+
lcc
i
n
0
]=max
f
lcr
i
m
]+
lcc
i
n
]
, for anyGH node
(
m;n
)
g
5. Map the selected
sp
i
to the GH node
(
m
0
;n
0
)
.6. Activate the local mapping process to map more
sp
0
s
to available local slots.7. Repeat step 2, 3, 4, and 6 until all
sp
0
s
have beenmapped.
4 Simulated Experiments
In this paper, the parallel computation time is deter-mined by the critical path length unless the target machineruns out of PEs. The performance of the algorithms willbe evaluated by the
number of inter-communication links
(
nicl
) among tasks on PEs. For a fully connected topology,
nicl
is just the number of links among all scheduled paths.Fora real target machine,
nicl
is the summationof commu-nication costs among
sp
0
s
on PEs expressed as the sum of the total numbers of hops.For the 2-D GH, on each dimension, PE nodes arefully connected. Any communication along the same di-mension will have weight = 1. Other communication be-tweentwo scheduledpathswithdifferentrowsandcolumnswill have weight = 2. The summation of them will indicatethe total communication cost.Next, free scheduling, reﬁned free scheduling andpath-driven scheduling are applied to a simple nested loopexample, the matrix multiplication and the Jacobi loops.The comparison of
nicl
0
s
will be reported on the fully con-nected topology and 2-D GH.

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks