DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES

Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deploymen...

Full description

Saved in:
Bibliographic Details
Main Author: Fadhriga Bestari, Muhammad
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/48067
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:48067
spelling id-itb.:480672020-06-26T00:20:50ZDYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES Fadhriga Bestari, Muhammad Indonesia Final Project Kubernetes, Kubeflow, deep learning job, job scheduling, scale up, scale down, gang scheduling, weighted job INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/48067 Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON’s and OASIS’ approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON’s autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON’s and OASIS’ approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON’s autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler.
format Final Project
author Fadhriga Bestari, Muhammad
spellingShingle Fadhriga Bestari, Muhammad
DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
author_facet Fadhriga Bestari, Muhammad
author_sort Fadhriga Bestari, Muhammad
title DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_short DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_full DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_fullStr DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_full_unstemmed DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_sort dynamic resource scheduler for distributed deep learning training in kubernetes
url https://digilib.itb.ac.id/gdl/view/48067
_version_ 1822271619631939584