DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a traini...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/39082 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:39082 |
---|---|
spelling |
id-itb.:390822019-06-21T15:11:27ZDYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER Yesa Surya, Rahmad Indonesia Final Project TensorFlow, Kubernetes, distributed training, resource allocation. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/39082 Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a training running on general purpose cluster where there are possibly some other processess running other than the training. In the middle of its run, some other processess may end, giving free resources on the cluster. The training with static configuration, therefore, cannot utilize those free resources to speed up its process. Besides that, if those free resources are unused for a long time, cluster’s resource become underutilized. In this final project, dynamic resource allocation is designed and implemented for TensorFlow training job in Kubernetes cluster. The implementation is done by creating a component called ConfigurationManager (CM). The role of this component is to know the cluster’s resource information as well as adding more ps and worker node to the training once free resources exist. Training is designed to communicate periodically with CM to report its progress. The experiment shows that the training with dynamic resource allocation has better performance than one with static resource allocation on following metrics: resource usage, epoch time, and total training time. However, training with dynamic resource allocation has slightly lower accuracy than one with static resource allocation. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Distributed deep learning training nowadays use static resource allocation. Using
parameter server architecture, deep learning training is carried out by several
parameter server (ps) nodes and worker nodes. Their numbers are constant while
the training is running, hence static. Consider a training running on general purpose
cluster where there are possibly some other processess running other than the
training. In the middle of its run, some other processess may end, giving free
resources on the cluster. The training with static configuration, therefore, cannot
utilize those free resources to speed up its process. Besides that, if those free
resources are unused for a long time, cluster’s resource become underutilized. In
this final project, dynamic resource allocation is designed and implemented for
TensorFlow training job in Kubernetes cluster. The implementation is done by
creating a component called ConfigurationManager (CM). The role of this
component is to know the cluster’s resource information as well as adding more ps
and worker node to the training once free resources exist. Training is designed to
communicate periodically with CM to report its progress. The experiment shows
that the training with dynamic resource allocation has better performance than one
with static resource allocation on following metrics: resource usage, epoch time,
and total training time. However, training with dynamic resource allocation has
slightly lower accuracy than one with static resource allocation. |
format |
Final Project |
author |
Yesa Surya, Rahmad |
spellingShingle |
Yesa Surya, Rahmad DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER |
author_facet |
Yesa Surya, Rahmad |
author_sort |
Yesa Surya, Rahmad |
title |
DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER |
title_short |
DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER |
title_full |
DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER |
title_fullStr |
DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER |
title_full_unstemmed |
DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER |
title_sort |
dynamic resource allocation for deep learning training using tensorflow on kubernetes cluster |
url |
https://digilib.itb.ac.id/gdl/view/39082 |
_version_ |
1822269169699127296 |