THE EFFECT OF PERFORMANCE METRICS MODIFICATION ON HYBRID RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING

Distributed deep learning is a method in machine learning that is used for complex and time-consuming feature extraction. One of the frameworks that is used to perform distributed machine learning is AdaptDL. AdaptDL runs machine learning processes on top of a Kubernetes cluster using the Poll...

Full description

Saved in:
Bibliographic Details
Main Author: Aptanagi, Pandyaka
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/55924
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Distributed deep learning is a method in machine learning that is used for complex and time-consuming feature extraction. One of the frameworks that is used to perform distributed machine learning is AdaptDL. AdaptDL runs machine learning processes on top of a Kubernetes cluster using the Pollux scheduler system. In determining the scheduling decisions, Pollux only provides one performance metric, namely the Goodput metric, and does not provide any other options. In addition, Pollux also has the potential to maximize the training speed by changing the Goodput value, as well as the potential to streamline resources by changing the threshold value for determining the scaling mechanism. In this research, AdaptDL was developed by adding performance metrics options, metrics to maximize speed, and metrics to increase resources efficiency. The options for performance metrics were implemented in the AdaptDL framework, the speed metric was carried out by modifying the Goodput equation, and the efficiency metric was implemented by modifying the Pollux. Based on the test results using the image recognition classification model on the MNIST dataset, development and modification did not affect the accuracy of the resulting model but affect other aspects of performance. Adding options for performance metrics did not affect overall learning performance. Modifications to the metric for speed affected the training speed so that it slowed down by 16.099%. Meanwhile, modifications to the metric for resource efficiency affected the training speed so that it slowed down by 106.977%, resource generation time increased by 80%, and resource usage increased by 19.31%