WORKER BALANCING IMPLEMENTATION ON DRAGON SCHEDULER FOR DISTRIBUTED DEEP LEARNING IN KUBERNETES
Deep learning is generally have much more computational process than conventional machine learning, so it requires a lot of time for the training process. Distributed deep learning is an alternative approach to reduce training time by distributing the computational load across multiple machines....
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/65753 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Deep learning is generally have much more computational process than conventional machine
learning, so it requires a lot of time for the training process. Distributed deep learning is an
alternative approach to reduce training time by distributing the computational load across
multiple machines. DRAGON scheduler is a scheduler that is used to schedule various
distributed training tasks using parameter server architecture with Tensorflow on a Kubernetes
cluster.
The DRAGON scheduler has the advantage of being able to scale the number of workers from
a training job depending on the availability of resources in the cluster. Based on the
implementation of scaling on the DRAGON scheduler, the process of adding and subtracting
workers is focused on one job first. However, it was found that the implementation being
inefficient in terms of training duration because a limitation of parameter server architecture.
So, due to these limitations, it is necessary to modify the scaling process in the DRAGON
scheduler by implementing worker balancing, which was implemented in this Final Project. In
the DRAGON scheduler that is modified using worker balancing, the duration of the training
can be reduced by 16.305% while maintaining the accuracy of the prediction of training results.
|
---|