DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER

Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a traini...

Full description

Saved in:

Bibliographic Details
Main Author:	Yesa Surya, Rahmad
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/39082
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:39082
spelling	id-itb.:390822019-06-21T15:11:27ZDYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER Yesa Surya, Rahmad Indonesia Final Project TensorFlow, Kubernetes, distributed training, resource allocation. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/39082 Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a training running on general purpose cluster where there are possibly some other processess running other than the training. In the middle of its run, some other processess may end, giving free resources on the cluster. The training with static configuration, therefore, cannot utilize those free resources to speed up its process. Besides that, if those free resources are unused for a long time, cluster’s resource become underutilized. In this final project, dynamic resource allocation is designed and implemented for TensorFlow training job in Kubernetes cluster. The implementation is done by creating a component called ConfigurationManager (CM). The role of this component is to know the cluster’s resource information as well as adding more ps and worker node to the training once free resources exist. Training is designed to communicate periodically with CM to report its progress. The experiment shows that the training with dynamic resource allocation has better performance than one with static resource allocation on following metrics: resource usage, epoch time, and total training time. However, training with dynamic resource allocation has slightly lower accuracy than one with static resource allocation. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Distributed deep learning training nowadays use static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes. Their numbers are constant while the training is running, hence static. Consider a training running on general purpose cluster where there are possibly some other processess running other than the training. In the middle of its run, some other processess may end, giving free resources on the cluster. The training with static configuration, therefore, cannot utilize those free resources to speed up its process. Besides that, if those free resources are unused for a long time, cluster’s resource become underutilized. In this final project, dynamic resource allocation is designed and implemented for TensorFlow training job in Kubernetes cluster. The implementation is done by creating a component called ConfigurationManager (CM). The role of this component is to know the cluster’s resource information as well as adding more ps and worker node to the training once free resources exist. Training is designed to communicate periodically with CM to report its progress. The experiment shows that the training with dynamic resource allocation has better performance than one with static resource allocation on following metrics: resource usage, epoch time, and total training time. However, training with dynamic resource allocation has slightly lower accuracy than one with static resource allocation.
format	Final Project
author	Yesa Surya, Rahmad
spellingShingle	Yesa Surya, Rahmad DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
author_facet	Yesa Surya, Rahmad
author_sort	Yesa Surya, Rahmad
title	DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_short	DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_full	DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_fullStr	DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_full_unstemmed	DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER
title_sort	dynamic resource allocation for deep learning training using tensorflow on kubernetes cluster
url	https://digilib.itb.ac.id/gdl/view/39082
_version_	1822269169699127296

DYNAMIC RESOURCE ALLOCATION FOR DEEP LEARNING TRAINING USING TENSORFLOW ON KUBERNETES CLUSTER

Similar Items