FC² : cloud-based cluster provisioning for distributed machine learning

Training large, complex machine learning models such as deep neural networks with big data requires powerful computing clusters, which are costly to acquire, use and maintain. As a result, many machine learning researchers turn to cloud computing services for on-demand and elastic resource provision...

Full description

Saved in:
Bibliographic Details
Main Author: Ta, Nguyen Binh Duong
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2021
Subjects:
Online Access:https://hdl.handle.net/10356/151787
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-151787
record_format dspace
spelling sg-ntu-dr.10356-1517872021-07-16T05:23:25Z FC² : cloud-based cluster provisioning for distributed machine learning Ta, Nguyen Binh Duong School of Computer Science and Engineering Engineering::Computer science and engineering Distributed Machine Learning Cloud-based Clusters Training large, complex machine learning models such as deep neural networks with big data requires powerful computing clusters, which are costly to acquire, use and maintain. As a result, many machine learning researchers turn to cloud computing services for on-demand and elastic resource provisioning capabilities. Two issues have arisen from this trend: (1) if not configured properly, training models on cloud-based clusters could incur significant cost and time, and (2) many researchers in machine learning tend to focus more on model and algorithm development, so they may not have the time or skills to deal with system setup, resource selection and configuration. In this work, we propose and implement FC²: a system for fast, convenient and cost-effective distributed machine learning over public cloud resources. Central to the effectiveness of FC² is the ability to recommend an appropriate resource configuration in terms of cost and execution time for a given model training task. Our approach differs from previous work in that it does not need to manually analyze the code and dataset of the training task in advance. The recommended resource configuration can then be deployed and managed automatically by FC² until the training task is completed. We have conducted extensive experiments with an implementation of FC², using real-world deep neural network models and datasets. The results demonstrate the effectiveness of our approach, which could produce cost saving of up to 80% while maintaining similar training performance compared to much more expensive resource configurations. Ministry of Education (MOE) The research has been supported via the Academic Research Fund (AcRF) Tier 1 Grant RG121/15. 2021-07-16T05:23:25Z 2021-07-16T05:23:25Z 2019 Journal Article Ta, N. B. D. (2019). FC² : cloud-based cluster provisioning for distributed machine learning. Cluster Computing, 22(4), 1299-1315. https://dx.doi.org/10.1007/s10586-019-02912-6 1386-7857 0000-0002-2882-2837 https://hdl.handle.net/10356/151787 10.1007/s10586-019-02912-6 2-s2.0-85061267724 4 22 1299 1315 en RG121/15 Cluster Computing © 2019 Springer Science+Business Media, LLC, part of Springer Nature. All right reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Distributed Machine Learning
Cloud-based Clusters
spellingShingle Engineering::Computer science and engineering
Distributed Machine Learning
Cloud-based Clusters
Ta, Nguyen Binh Duong
FC² : cloud-based cluster provisioning for distributed machine learning
description Training large, complex machine learning models such as deep neural networks with big data requires powerful computing clusters, which are costly to acquire, use and maintain. As a result, many machine learning researchers turn to cloud computing services for on-demand and elastic resource provisioning capabilities. Two issues have arisen from this trend: (1) if not configured properly, training models on cloud-based clusters could incur significant cost and time, and (2) many researchers in machine learning tend to focus more on model and algorithm development, so they may not have the time or skills to deal with system setup, resource selection and configuration. In this work, we propose and implement FC²: a system for fast, convenient and cost-effective distributed machine learning over public cloud resources. Central to the effectiveness of FC² is the ability to recommend an appropriate resource configuration in terms of cost and execution time for a given model training task. Our approach differs from previous work in that it does not need to manually analyze the code and dataset of the training task in advance. The recommended resource configuration can then be deployed and managed automatically by FC² until the training task is completed. We have conducted extensive experiments with an implementation of FC², using real-world deep neural network models and datasets. The results demonstrate the effectiveness of our approach, which could produce cost saving of up to 80% while maintaining similar training performance compared to much more expensive resource configurations.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Ta, Nguyen Binh Duong
format Article
author Ta, Nguyen Binh Duong
author_sort Ta, Nguyen Binh Duong
title FC² : cloud-based cluster provisioning for distributed machine learning
title_short FC² : cloud-based cluster provisioning for distributed machine learning
title_full FC² : cloud-based cluster provisioning for distributed machine learning
title_fullStr FC² : cloud-based cluster provisioning for distributed machine learning
title_full_unstemmed FC² : cloud-based cluster provisioning for distributed machine learning
title_sort fc² : cloud-based cluster provisioning for distributed machine learning
publishDate 2021
url https://hdl.handle.net/10356/151787
_version_ 1707050394911244288