THE EFFECT OF PERFORMANCE METRICS MODIFICATION ON HYBRID RESOURCE SCHEDULER FOR DISTRIBUTED DEEP LEARNING TRAINING
Distributed deep learning is a method in machine learning that is used for complex and time-consuming feature extraction. One of the frameworks that is used to perform distributed machine learning is AdaptDL. AdaptDL runs machine learning processes on top of a Kubernetes cluster using the Poll...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/55924 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Distributed deep learning is a method in machine learning that is used for complex
and time-consuming feature extraction. One of the frameworks that is used to
perform distributed machine learning is AdaptDL. AdaptDL runs machine learning
processes on top of a Kubernetes cluster using the Pollux scheduler system. In
determining the scheduling decisions, Pollux only provides one performance
metric, namely the Goodput metric, and does not provide any other options. In
addition, Pollux also has the potential to maximize the training speed by changing
the Goodput value, as well as the potential to streamline resources by changing the
threshold value for determining the scaling mechanism. In this research, AdaptDL
was developed by adding performance metrics options, metrics to maximize speed,
and metrics to increase resources efficiency. The options for performance metrics
were implemented in the AdaptDL framework, the speed metric was carried out by
modifying the Goodput equation, and the efficiency metric was implemented by
modifying the Pollux. Based on the test results using the image recognition
classification model on the MNIST dataset, development and modification did not
affect the accuracy of the resulting model but affect other aspects of performance.
Adding options for performance metrics did not affect overall learning
performance. Modifications to the metric for speed affected the training speed so
that it slowed down by 16.099%. Meanwhile, modifications to the metric for
resource efficiency affected the training speed so that it slowed down by 106.977%,
resource generation time increased by 80%, and resource usage increased by
19.31% |
---|