DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES

Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deploymen...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Fadhriga Bestari, Muhammad
التنسيق:	Final Project
اللغة:	Indonesia
الوصول للمادة أونلاين:	https://digilib.itb.ac.id/gdl/view/48067
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Institut Teknologi Bandung
اللغة:	Indonesia

id	id-itb.:48067
spelling	id-itb.:480672020-06-26T00:20:50ZDYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES Fadhriga Bestari, Muhammad Indonesia Final Project Kubernetes, Kubeflow, deep learning job, job scheduling, scale up, scale down, gang scheduling, weighted job INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/48067 Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON’s and OASIS’ approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON’s autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON’s and OASIS’ approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON’s autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler.
format	Final Project
author	Fadhriga Bestari, Muhammad
spellingShingle	Fadhriga Bestari, Muhammad DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
author_facet	Fadhriga Bestari, Muhammad
author_sort	Fadhriga Bestari, Muhammad
title	DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_short	DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_full	DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_fullStr	DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_full_unstemmed	DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES
title_sort	dynamic resource scheduler for distributed deep learning training in kubernetes
url	https://digilib.itb.ac.id/gdl/view/48067
_version_	1823641416384905216

DYNAMIC RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING IN KUBERNETES

مواد مشابهة