DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER

Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a traini...

Full description

Saved in:
Bibliographic Details
Main Author: Yesa Surya, Rahmad
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/39082
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:39082
spelling id-itb.:390822019-06-21T15:11:27ZDYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER Yesa Surya, Rahmad Indonesia Final Project TensorFlow, Kubernetes, distributed training, resource allocation. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/39082 Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a training running on general purpose cluster where there are possibly some other processess running other than the training. In the middle of its run, some other processess may end, giving free resources on the cluster. The training with static configuration, therefore, cannot utilize those free resources to speed up its process. Besides that, if those free resources are unused for a long time, cluster’s resource become underutilized. In this final project, dynamic resource allocation is designed and implemented for TensorFlow training job in Kubernetes cluster. The implementation is done by creating a component called ConfigurationManager (CM). The role of this component is to know the cluster’s resource information as well as adding more ps and worker node to the training once free resources exist. Training is designed to communicate periodically with CM to report its progress. The experiment shows that the training with dynamic resource allocation has better performance than one with static resource allocation on following metrics: resource usage, epoch time, and total training time. However, training with dynamic resource allocation has slightly lower accuracy than one with static resource allocation. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a training running on general purpose cluster where there are possibly some other processess running other than the training. In the middle of its run, some other processess may end, giving free resources on the cluster. The training with static configuration, therefore, cannot utilize those free resources to speed up its process. Besides that, if those free resources are unused for a long time, cluster’s resource become underutilized. In this final project, dynamic resource allocation is designed and implemented for TensorFlow training job in Kubernetes cluster. The implementation is done by creating a component called ConfigurationManager (CM). The role of this component is to know the cluster’s resource information as well as adding more ps and worker node to the training once free resources exist. Training is designed to communicate periodically with CM to report its progress. The experiment shows that the training with dynamic resource allocation has better performance than one with static resource allocation on following metrics: resource usage, epoch time, and total training time. However, training with dynamic resource allocation has slightly lower accuracy than one with static resource allocation.
format Final Project
author Yesa Surya, Rahmad
spellingShingle Yesa Surya, Rahmad
DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
author_facet Yesa Surya, Rahmad
author_sort Yesa Surya, Rahmad
title DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_short DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_full DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_fullStr DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_full_unstemmed DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_sort dynamic resource allocation for deep learning training using tensorflow on kubernetes cluster
url https://digilib.itb.ac.id/gdl/view/39082
_version_ 1822269169699127296