Hadoop job scheduling with dynamic task splitting

Job scheduling affects the fairness and performance of shared Hadoop clusters. Fairness measures how fair the resources in the cluster are shared among different users in the Hadoop cluster. In Hadoop, schedulers will always attempt to maximize data locality. Data locality refers to the processing o...

Full description

Saved in:
Bibliographic Details
Main Author: Xu, Yongliang
Other Authors: Cai Wentong
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/65309
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-65309
record_format dspace
spelling sg-ntu-dr.10356-653092023-03-04T00:42:53Z Hadoop job scheduling with dynamic task splitting Xu, Yongliang Cai Wentong School of Computer Engineering Parallel and Distributed Computing Centre DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications Job scheduling affects the fairness and performance of shared Hadoop clusters. Fairness measures how fair the resources in the cluster are shared among different users in the Hadoop cluster. In Hadoop, schedulers will always attempt to maximize data locality. Data locality refers to the processing of data by tasks on nodes where the data is stored. Processing of data on data-local nodes improves performance, as there is no need to transfer data from one node to another. However, fairness and data locality are often in conflict. During scheduling, it is not always possible that the available nodes contain the data that a user’s job requires. In such cases, a scheduler may choose to schedule the tasks on these nodes regardless of data locality thus sacrificing performance. Alternatively, a scheduler may choose to give up the user’s slot and wait for a data-local node thus sacrificing fairness. Achieving pure fairness may compromise the data locality of the tasks that will in turn negatively affects performances, and vice-versa. Delay scheduling is a technique that attempts to improve data locality by waiting for a data-local node to be available. It violates the fairness criteria. The Dynamic Task Splitting Scheduler (DTSS) is proposed to mitigate the tradeoffs between fairness and data locality during job scheduling. DTSS does so by dynamically splitting a task and executing the split task immediately, on a non-data-local node, to improve the fairness. Analysis and experiments results show that it is possible to improve both fairness and the performance by adjusting the proportion of the task split. DTSS is shown to improve the makespan of different users in a cluster by 2% to 11% as compared to delay scheduling under conditions that is difficult to obtain data-local nodes on a cluster. Lastly, experiments show that DTSS is not a suitable scheduler under conditions where jobs are able to obtain data-local nodes easily. MASTER OF ENGINEERING (SCE) 2015-07-16T03:45:23Z 2015-07-16T03:45:23Z 2015 2015 Thesis Xu, Y. (2015). Hadoop job scheduling with dynamic task splitting. Master’s thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/65309 10.32657/10356/65309 en 68 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications
spellingShingle DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications
Xu, Yongliang
Hadoop job scheduling with dynamic task splitting
description Job scheduling affects the fairness and performance of shared Hadoop clusters. Fairness measures how fair the resources in the cluster are shared among different users in the Hadoop cluster. In Hadoop, schedulers will always attempt to maximize data locality. Data locality refers to the processing of data by tasks on nodes where the data is stored. Processing of data on data-local nodes improves performance, as there is no need to transfer data from one node to another. However, fairness and data locality are often in conflict. During scheduling, it is not always possible that the available nodes contain the data that a user’s job requires. In such cases, a scheduler may choose to schedule the tasks on these nodes regardless of data locality thus sacrificing performance. Alternatively, a scheduler may choose to give up the user’s slot and wait for a data-local node thus sacrificing fairness. Achieving pure fairness may compromise the data locality of the tasks that will in turn negatively affects performances, and vice-versa. Delay scheduling is a technique that attempts to improve data locality by waiting for a data-local node to be available. It violates the fairness criteria. The Dynamic Task Splitting Scheduler (DTSS) is proposed to mitigate the tradeoffs between fairness and data locality during job scheduling. DTSS does so by dynamically splitting a task and executing the split task immediately, on a non-data-local node, to improve the fairness. Analysis and experiments results show that it is possible to improve both fairness and the performance by adjusting the proportion of the task split. DTSS is shown to improve the makespan of different users in a cluster by 2% to 11% as compared to delay scheduling under conditions that is difficult to obtain data-local nodes on a cluster. Lastly, experiments show that DTSS is not a suitable scheduler under conditions where jobs are able to obtain data-local nodes easily.
author2 Cai Wentong
author_facet Cai Wentong
Xu, Yongliang
format Theses and Dissertations
author Xu, Yongliang
author_sort Xu, Yongliang
title Hadoop job scheduling with dynamic task splitting
title_short Hadoop job scheduling with dynamic task splitting
title_full Hadoop job scheduling with dynamic task splitting
title_fullStr Hadoop job scheduling with dynamic task splitting
title_full_unstemmed Hadoop job scheduling with dynamic task splitting
title_sort hadoop job scheduling with dynamic task splitting
publishDate 2015
url https://hdl.handle.net/10356/65309
_version_ 1759854644202307584