WORKER BALANCING IMPLEMENTATION ON DRAGON SCHEDULER FOR DISTRIBUTED DEEP LEARNING IN KUBERNETES

Deep learning is generally have much more computational process than conventional machine learning, so it requires a lot of time for the training process. Distributed deep learning is an alternative approach to reduce training time by distributing the computational load across multiple machines....

Full description

Saved in:
Bibliographic Details
Main Author: Prima Yoriko, Naufal
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/65753
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Deep learning is generally have much more computational process than conventional machine learning, so it requires a lot of time for the training process. Distributed deep learning is an alternative approach to reduce training time by distributing the computational load across multiple machines. DRAGON scheduler is a scheduler that is used to schedule various distributed training tasks using parameter server architecture with Tensorflow on a Kubernetes cluster. The DRAGON scheduler has the advantage of being able to scale the number of workers from a training job depending on the availability of resources in the cluster. Based on the implementation of scaling on the DRAGON scheduler, the process of adding and subtracting workers is focused on one job first. However, it was found that the implementation being inefficient in terms of training duration because a limitation of parameter server architecture. So, due to these limitations, it is necessary to modify the scaling process in the DRAGON scheduler by implementing worker balancing, which was implemented in this Final Project. In the DRAGON scheduler that is modified using worker balancing, the duration of the training can be reduced by 16.305% while maintaining the accuracy of the prediction of training results.