AP03_6.vp 1 Introduction Efficient scheduling of computationally intensive programs is one of the most essential and most difficult issues to achieve high performance in a homogeneous computing environment [1]. When the characteristics of an application are known a priori, including tasks execution time, data size of communication between tasks, and task dependencies, the application is represented by a static model [2]. In the static model, the application is represented by a directed acyclic graph (DAG) in which the nodes represent the application tasks and the edges represent intertask data dependencies. Each node is labeled by the computation cost (expected computation time) of the task and each edge is labeled by the communication cost (expected communication time) between tasks [3, 4, 5, 6, 7]. The objective of scheduling is to map the tasks onto the processors (machines) and order their execution so the that task dependencies are satisfied and minimum over- all scheduling length (makespan) is achieved. Finding an optimal solution for the scheduling problem is NP-complete [3, 4, 5, 6, 7]. Therefore, it is necessary to have heuristics to find the best scheduling rather than evaluate all possible scheduling combinations. Most scheduling heuristics algorithms are based on list- -scheduling [2, 3, 4, 7]. List-scheduling consists of two phases: a task prioritizing phase, where a priority is computed and assigned to each node of the DAG, and a processor selection phase, where each task (in order of its priority) is assigned a processor that minimizes a suitable cost function. The scheduling heuristic is called static if the processor selection phase starts after completion of the task prioritizing phase [2, 8] and it is called dynamic if the two phases are inter- leaved [9, 10]. This paper presents the characteristics of the two main static and the two main dynamic list-scheduling algorithms. It also compares their performance over a 90K variant random graph. The remainder of this paper is organized as follows. The next section defines the static task-scheduling problem and gives the background of the problem including some definitions and parameters used in the algorithms. Section 3 presents a brief review of the examined algorithms. Section 4, presents a performance comparison of the reviewed algo- rithm. Section 5 provides the conclusion. 2 Task Scheduling Problem This section presents the application model used for static task scheduling and the homogeneous computing environ- ments model that they will be used for the surveyed algorithms. The application can be represented by a directed acyclic graph G(V, E, C, W) where: V is the set of v nodes, and each node v Vi � represents an application task, which is a sequence of instructions that must be executed serially on the same processor, W is the set of computation costs, where w Wi � is the execution time of task vi, E is the set of communication edges. The directed edge ei, j joins nodes vi and vj, where node vi is called the parent node and node vj is called the child node. This also implies that vj cannot start until vi finishes and sends its data to vj. C is the set of communication costs, and the edge ei, j has a communication cost c Ci j, � . A task without any parent is called an entry task and a task without any child is called an exit task. If there is more than one exit (entry) task, they may be connected to a zero-cost pseudo exit (entry) task with zero-cost edges, which do not affect the schedule. The homogeneous computing environment model is a set P of p identical processors connected in a fully connected graph. It is also assumed that: � any processor can execute the task and communicate with other processors at the same time, � once a processor has started task execution, it continues without interruption, and on completing the execution it sends immediately the output data to all children tasks in parallel. 16 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 43 No. 6/2003 Static vs. Dynamic List-Scheduling Performance Comparison T. Hagras, J. Janeček The problem of efficient task scheduling is one of the most important and most difficult issues in homogeneous computing environments. Finding an optimal solution for a scheduling problem is NP-complete. Therefore, it is necessary to have heuristics to find a reasonably good schedule rather than evaluate all possible schedules. List-scheduling is generally accepted as an attractive approach, since it pairs low complexity with good results. List-scheduling algorithms schedule tasks in order of priority. This priority can be computed either statically (before scheduling) or dynamically (during scheduling). This paper presents the characteristics of the two main static and the two main dynamic list-scheduling algorithms. It also compares their performance in dealing with random generated graphs with various characteristics. Keywords: list scheduling, compile time scheduling, task graph scheduling, homogeneous computing. The communication cost ci, j for transferring data from task vi (scheduled on pm) to task vj (scheduled on pn), is defined as: c S Ri j i j, ,� � � � , where S is the cost of starting communication between proces- sors (in secs), �i, j is the amount of data transmitted from task vi to task vj (in bytes), R is the cost of communication per transferred byte (in sec/byte). It is assumed that startup cost S is negligible and the unit cost R is the same for any two processors, so that the communication cost for any two tasks is a function of the amount of transferred data only. 2.1 Basic Scheduling Attributes The frequently used attributes for assigning priority in list-scheduling are t � level (top level) and b � level (bottom level). The t � level of node vi is the length of the longest path from the entry node to vi (excluding vi). Here, the length of the path is the sum of all nodes and edges weights along the path. The t level vi� ( ) is computed recursively by traversing the DAG downward starting from the entry node v entry as follows: � � � � t level v t level v w ci v pred v m m m i m i � � � � � � max ( ) , , where pred(vi) is the set of immediate predecessors of vi and t level ventry� �( ) 0. The � �b level vi� of node vi is the length of the longest path from vi to the exit node. The � �b level vi� is computed recursively by traversing the DAG upward starting from the exit node vexit as follows: � � � � b level v w b level v ci i v succ v m i m m i � � � � � � max ( ) , , where succ(vi) is the set of immediate successors of vi and � � � �b level v w vexit exit� � . If the edge weights are not taken into account in comput- ing the b level� , the b level� in this case is called the static b level� or simply the static level (SL). The SL can be computed recursively by traversing the DAG upward starting from the exit node vexit as follows: � � � � SL v w SL vi i v succ v m m i � � � max ( ) , where succ(vi) is the set of immediate successors of vi and � � � �SL v w vexit exit� . Two other attributes are also used to assign priority to the nodes: EST (Earliest Start Time), also called ASAP (As Soon As Possible), and LST (Latest Start Time), also called ALAP (As Late As Possible). The EST(vi) of vi is highly correlated with the t level� of vi and the procedure for computing the t level� can be used to compute the nodes earliest start times. The EST can be computed recursively by traversing the DAG downward start- ing from the entry node vexit as follows: � � � � EST v EST v w ci v pred v m m m i m i � � � � max ( ) , , where pred vi( ) is the set of immediate predecessors of vi and EST ventry( ) � 0 . The LST can be computed recursively by traversing the DAG upward, starting from the exit node vexit, as follows: � � � � LST v LST v c wi v succ v m i m i m i � � � � min ( ) , , where succ(vi) is the set of immediate successors of vi and LST v EST vexit exit( ) ( )� . The critical path (CP) of a DAG is the longest path from the entry node to the exit node. Clearly a DAG can have more than one CP. Consider the DAG shown in Fig. 1, where each node has two labels, the upper one is indicating the node la- bel and the lower one is indicating the node weight. In this DAG, the nodes v1, v2, v9, v10 are the CP nodes and are called CPNs (Critical Path Nodes). The edges on the CP are shown by thick arrows. The values of the priorities discussed above are shown in Table 1. © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 17 Acta Polytechnica Vol. 43 No. 6/2003 Fig. 1: Application directed acyclic graph (DAG) Node SL t � level & EST b � level LST 1 80 0 104 0 2 60 28 76 28 3 50 24 60 44 4 55 22 65 39 5 45 28 61 43 6 40 24 52 52 7 30 52 32 72 8 35 52 39 65 9 40 56 48 56 10 20 84 20 84 Table 1: Priority attributes 3 List-Scheduling Algorithms This section presents the two main static list-scheduling al- gorithms and the two main dynamic list-scheduling algorithms. All these algorithms are for a limited number of homogeneous processors. 3.1 Static List-Scheduling Algorithms This section briefly reviews the two main static list-sched- uling algorithms, which are the Highest Level First with Estimated Time (HLFET) algorithm [2] and the Modified Critical Path (MCP) algorithm [8]. 3.1.1 HLFET algorithm The Highest Level First with Estimated Time (HLFET) algorithms [2] is one of the simplest list-scheduling algo- rithms and is described as follows in Fig. 2. The complexity of the HLFET algorithm is O (pv2) . For the DAG shown in Fig. 1, the scheduling trace of HLFET al- gorithm is given in Table 2. In the table, the execution start times of each node on all available processors at each step are given, and the nodes on the list are scheduled one by one, to the processor that allows the earliest execution start time. 3.1.2 MCP Algorithm The Modified Critical Path (MCP) algorithm [8] uses the ALAP attribute (LST defined in section 2) of a node as the scheduling priority. The MCP algorithm first computes the ALAPs of all nodes, and then constructs a list of nodes in ascending order of nodes ALAP. In the case of equiva- lent ALAP values, the ALAPs of the children are taken into consideration to break the tie. The MCP algorithm then schedules the nodes on the list one by one such that a node is scheduled to the processor that allows the earliest execution start time. The MCP algorithm is shown in Fig. 3. The complexity of the MCP algorithm is O(pv2). For the DAG shown in Fig. 1, the scheduling trace of MCP algorithm is given in Table 3. 3.2 Dynamic List-Scheduling Algorithms This section briefly reviews the two main dynamic list- -scheduling algorithms, which are the Earliest Time First (ETF) algorithm [9] and the Dynamic Level Scheduling (DLS) algorithm [10]. 18 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 43 No. 6/2003 1. Compute the SL (static level) for each node in the graph 2. Put all nodes in a list L and sort L in a descending order of nodes SL 3. while not the end L do � dequeue vi from L � compute earliest execution start time for vi in all processors � schedule vi to the processor that minimizes the node earliest execution start time Fig. 2: HLFET algorithm Step Selected v p1 p2 p3 Selected p 1 1 0 0 0 p1 2 2 20 28 28 p1 3 4 40 22 22 p2 4 3 40 37 24 p3 5 5 40 37 44 p2 6 6 40 42 44 p1 7 9 50 48 50 p2 8 8 45 68 53 p1 9 7 60 68 44 p3 10 10 76 68 76 p2 Table 2: A scheduling trace of the HLFET algorithm (makespan � 88) 1. Compute the ALAP (LST in section 2) for each node in the graph 2. For each node, create a list, which consists of the ALAP of the node itself and all its children 3. Sort these lists in an ascending order of nodes ALAP 4. Create a node list L sorted in ascending order of nodes ALAP. Use nodes sorted lists (previous 2 steps) to break a tie 5. while not the end L do � dequeue vi from L � compute earliest execution start time for vi in all processors � schedule vi to the processor that minimizes the node earliest execution start time Fig. 3: MCP algorithm Step Selected v p1 p2 p3 Selected p 1 1 0 0 0 p1 2 2 20 28 28 p1 3 4 40 22 22 p2 4 5 40 37 28 p3 5 3 40 37 33 p3 6 6 40 37 53 p2 7 9 41 48 53 p1 8 8 61 44 53 p2 9 7 61 61 53 p3 10 10 65 69 69 p1 Table 3: A scheduling trace of the MCP algorithm (makespan � 85) 3.2.1 ETF Algorithm The Earliest Time First (ETF) algorithm [9] computes, at each step, the earliest execution start time (EEST) for all ready nodes and selects the one with the lowest value for scheduling. The ready node is defined as the node having all its parents scheduled. When two nodes have the same value of EEST, the ETF algorithm breaks the tie by scheduling the one with the higher static level. Fig. 4 shows the ETF algorithm. The complexity of the ETF algorithm is O(pv3). For the DAG shown in Fig. 1, the scheduling trace of ETF algorithm is given in Table 4. 3.2.2 DLS Algorithm The Dynamic Level Scheduling (DLS) algorithm [10] uses an attribute called the dynamic level (DL), which is the dif- ference between the static level of a node and its earliest execution start time. In each scheduling step, the node- processor pair that gives the largest value of DL is selected. This mechanism is similar to the one used by the ETF algo- rithm. However, there is one subtle difference between the ETF and DLS: the ETF algorithm schedules the node with the minimum earliest execution start time and uses the static level merely to break ties. In contrast, the DLS algorithm tends to schedule nodes in descending order of their static levels at the beginning of the process, but tends to schedule nodes in ascending order of EEST near the end of the pro- cess. The DLS algorithm is shown in Fig. 5. The complexity of the DLS algorithm is O(pv3). For the DAG shown in Fig. 1, the s of DLS algorithm is exactly the same as the scheduling trace of the the ETF algorithm as shown in Table 4. 4 Experimental Results and Discussion This section presents a performance comparison of the four algorithms given in section 3. For this purpose, we used randomly generated task graphs and the following comparison metrics are used. 4.1 Comparison Metrics The comparisons of the algorithms are based on the following metrics. Makespan The makespan is defined as the overall Completion time, and can be specified as follows: makespan FET vexit� ( ) , where FET vexit( ) is the finishing time of the scheduled exit node. Scheduling Length Ratio (SLR) The main performance measure is the scheduling length (makespan) of its output schedule. Since a large set of task graphs with different properties is used, it is necessary to normalize the schedule length to the lower bound, which is called the Schedule Length Ratio (SLR). The SLR value of an algorithm on a graph is defined as SLR makespan wi i CP � � � . © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 19 Acta Polytechnica Vol. 43 No. 6/2003 1. Compute the SL (static level) of each node in the graph 2. Initially, the ready nodes list includes only the entry node 3. while the ready list is not empty do � compute the earliest execution start time on each processor for each node in the ready list � select the node-processor pair that gives the ear- liest execution start time. Ties are broken by selecting the node with a higher SL � schedule the node to the corresponding selected processor � add the newly ready nodes to the ready list Fig. 4: ETF algorithm Step Selected v p1 p2 p3 Selected p 1 1 0 0 0 p1 2 2 20 28 28 p1 3 4 40 22 22 p2 4 3 40 37 24 p3 5 5 40 37 44 p2 6 6 40 42 44 p1 7 7 52 52 44 p3 8 8 45 53 53 p1 9 9 50 48 50 p2 10 10 76 68 76 p2 Table 4: A scheduling trace of the ETF and DLS algorithms (makespan � 88) 1. Compute the SL (static level) of each node in the graph 2. Initially, the ready nodes list includes only the entry node 3. while the ready list is not empty do � compute the earliest execution start time for every ready node on each processor � compute the DL of every node-processor pair by subtracting the earliest execution start time from the node’s static level (SL) � select the node-processor pair that gives the largest DL � schedule the node to the corresponding selected processor � add the newly ready nodes to the ready list Fig. 5: DLS algorithm The denominator is the sum of the computation costs of the tasks on a critical path (CP). The SLR of a graph (using any algorithm) cannot be less than one, since the denomina- tor is the lower bound. Average SLR values are used in our experiments. Speedup The speedup value is computed by dividing the sequential execution time (i.e., the cumulative computation costs of the tasks) by the parallel execution time (i.e., the makespan of the schedule). Number of occurrences of better quality of schedules The number of times that each algorithm produced a bet- ter, worse, and equal quality of schedules compared to every other algorithm is counted in the experiments. 4.2 Random Graph Generator The random graph generator was implemented to gener- ate weighted application DAGs with various characteristics that depend on several input parameters. The generator requires the following input parameters to build weighted DAGs. � number of tasks in a graph v, � graph levels l, � communication to computation ratio CCR, which is defined as the ratio of the average communication cost to the average computation cost. In all experiments, graphs with a single entry and a single exit node were considered. In each experiment, the values of the previous parameters are selected from the corresponding set given below. v v l v CCR � � � � 20 30 40 50 60 70 80 90 100 0 2 0 8 0 5 , , , , , , , , , . . , . , 10 2 0. , . . These input parameters were used to generate 10 k dif- ferent DAGs with various characteristics for each v from the used v set. 4.3 Performance Results The performances of the algorithms were compared with respect to different graph size. The experiments were re- peated for each v from the v set given above. For each v, 10 k graph were generated using random selection for CCR and levels (l) (given above) for each graph. The average SLR for each v is given in Fig. 6. In general the performances of the dynamic algorithms are better than those of the static ones. The static algorithms have near equal performance while the ETF algorithm is better than the DLS in the dynamic algorithms. The average speedup is given in Fig. 7. In general, the speed up of the dynamic algorithms is better than the static algorithms. The two dynamic algorithms have almost the same performances, while HLFET is better than MCP. Finally, the percentage of situations that each scheduling algorithm in the experiments produced better (B), equal (E) or worse (W) scheduling length compared to every other algorithm was counted for 90k DAGs used. Each cell in Ta- ble 5 indicates the comparison results of the algorithm on the left with the algorithm at the top. Table 5 indicates that the dynamic algorithms are bet- ter than the static ones. As regards the static algorithms, the HLFET algorithm is better than the MCP, while the DLS algorithm is better than the ETF algorithm as regards the dynamic algorithms. We note that the algorithm complexity is an important factor that has to be taken into account when comparing the performance of different algorithms. As shown in Table 6, the complexity of the dynamic algorithms is much higher than 20 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 43 No. 6/2003 Fig. 6: Average SLR Fig. 7: Average Speedup HLFET MCP ETF DLS B 49.15 % 37.75 % 22.62 % HLFET E 12.47 % 22.37 % 42.17 % W 38.39 % 39.89 % 35.22 % B 38.39 % 40.84 % 36.41 % MCP E 12.47 % 5.84 % 8.35 % W 49.15 % 53.33 % 55.24 % B 39.89 % 53.33 % 26.93 % ETF E 22.37 % 5.84 % 39.53 % W 37.75 % 40.84 % 33.54 % B 35.22 % 55.24 % 33.54 % DLS E 42.17 % 8.35 % 39.43 % W 22.62 % 36.41 % 26.93 % Table 5: Pair-Wise Comparison of the examined algorithms that of static ones. This makes it unfair if the static algorithm gives the same schedule length as the dynamic one, to con- sider it as an equivalent trial. If we consider the equivalent scheduling length of two algorithms as a better trial of the lowest complexity algorithm, the better, equal, and worse comparison between the examined algorithms will be as shown in Table 7. 4.4. Ranking of Examined Algorithms Based on the above comparison metrics and the average results for 90k randomly generated DAGs, the ranking for the examined algorithms was as follows: average Makespan: DLS ETF HLFET MCP average Speedup: DLS ETF HLFET MCP average SLR: ETF DLS MCP HLFET best results: DLS ETF MCP HLFET complexity: HLFET & MCP DLS & ETF 5 Conclusion In this paper we present a brief description of the charac- teristics of the two most known static and also the two most known dynamic list-scheduling algorithms. The perfor- mances of these algorithms were examined using variant ran- dom generated graphs. Six comparison matrices were used to measure their performance. In general the dynamic list-scheduling algorithms performed better than the static list-scheduling algorithms. For the static list-scheduling algo- rithms, the HLFET performed better than MCP and for the dynamic list-scheduling algorithms, the DLS algorithm performed better than the ETF algorithm. If the complexities of the algorithms are taken into account, it is highly recom- mended to use static list-scheduling. References [1] Feitelson D., Rudolph L., Schwiegelshohm U., Sevcik K., Wong P.: Theory and Practice in Parallel Job Scheduling. JSSPP, 1997, p. 1–34. [2] Kwok Y., Ahmed I.: Benchmarking the Task Graph Sched- uling Algorithms. Proc. IPPS/SPDP, 1998. [3] Liou J., Palis M.: A Comparison of General Approaches to Multiprocessor Scheduling. Proc. Int’l Parallel Processing Symp., 1997, p. 152–156. [4] Khan A., McCreary C., Jones M.: A Comparison of Multiprocessor Scheduling Heuristics. ICPP, 1994, Vol. 2, p. 243–250. [5] Gerasoulis A., Yang T.: A Comparison of Clustering Heuristics for Scheduling DAGs on Multiprocessors. Journal of Parallel and Distributed Computing, 1992, Vol. 16, p. 276–291. [6] Gerasoulis A., Yang T.: On The Granularity and Clustering of Directed Acyclic Task Graphs. IEEE Trans. Parallel and Distributed Systems, 1993, Vol. 4, No. 6, p. 686–701. [7] Zhou H.: Scheduling DAGs on a bounded Number of Proces- sors. Int’l Conf., Parallel and Distributed Processing Techniques and Applications, 1996. [8] Min-You W., Gajski D.: Hypertool: A Programming Aid for Message-Passing Systems. IEEE Trans. Parallel and Dis- tributed Systems, 1990, Vol. 1, No. 3. [9] Hwang J., Chow Y., Anger E., Lee C.: Scheduling Prece- dence Graphs in Systems with Interprocessor Communication Times. SIAM Journal on Computing, 1989, Vol. 18, No. 2, p. 244–257. [10] Sih G., Lee E.: A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Archi- tectures. IEEE Trans. In Parallel and Distributed Systems, 1993, Vol. 4, No. 2, p. 75–87. Ing. Tarek Hagras phone: +420 224 357 267 e-mail: tarek@felk.cvut.cz Doc. Ing. Jan Janeček, CSc. phone: +420 224 357 267 e-mail: janecek@fel.cvut.cz Dept. of Computer Science and Engineering Czech Technical University in Prague Faculty of Electrical Engineering Karlovo nám. 13 121 35 Prague 2, Czech Republic © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 21 Acta Polytechnica Vol. 43 No. 6/2003 Algorithm Complexity HLFET O pv( )2 MCP O pv( )2 ETF O pv( )3 DLS O pv( )3 Table 6: Algorithms complexity HLFET MCP ETF DLS B 49.15 % 60.12 % 64.79 % HLFET E 12.47 % W 38.39 % 39.89 % 35.22 % B 38.39 % 46.68 % 44.76 % MCP E 12.47 % W 49.15 % 53.33 % 55.24 % B 39.89 % 53.33 % 26.93 % ETF E 39.53 % W 60.12 % 46.68 % 33.54 % B 35.22 % 55.24 % 33.54 % DLS E 39.53 % W 64.79 % 44.76 % 26.93 % Table 7: Pair-Wise comparison of scheduling algorithms com- plexity based