DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deploymen...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/48067 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:48067 |
---|---|
spelling |
id-itb.:480672020-06-26T00:20:50ZDYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES Fadhriga Bestari, Muhammad Indonesia Final Project Kubernetes, Kubeflow, deep learning job, job scheduling, scale up, scale down, gang scheduling, weighted job INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/48067 Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON’s and OASIS’ approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON’s autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Distributed deep learning is a method of machine learning that is used today due to its many
advantages. One of the many tools used to train distributed deep learning model is Kubeflow,
which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that
ease the deployment process of applications. This in turn makes distributed deep learning
training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in
Kubernetes for deep learning training have been done before, such as DRAGON that
proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that
proposed a utility system with price function. In this work, we propose to combine
DRAGON’s and OASIS’ approach to make a scheduler with weighted autoscaling
capabilities and schedule its jobs with gang scheduling. Some modifications are done on
DRAGON’s autoscaling function. We try to increase the frequency of scaling up function
calls and reduce the frequency of scaling down function to make the training process more
efficient. Weights are used to determine the priority of each jobs, where jobs with higher
resource requirements are considered more important. Weight of each jobs will influence the
autoscaling function of the scheduler. Experiment and evaluation done using a set of
Tensorflow jobs results in an increase of training speed by over 26% in comparison with the
default Kubernetes scheduler.
|
format |
Final Project |
author |
Fadhriga Bestari, Muhammad |
spellingShingle |
Fadhriga Bestari, Muhammad DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES |
author_facet |
Fadhriga Bestari, Muhammad |
author_sort |
Fadhriga Bestari, Muhammad |
title |
DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES |
title_short |
DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES |
title_full |
DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES |
title_fullStr |
DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES |
title_full_unstemmed |
DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES |
title_sort |
dynamic resource scheduler for distributed deep learning training in kubernetes |
url |
https://digilib.itb.ac.id/gdl/view/48067 |
_version_ |
1822271619631939584 |