Building efficient and practical machine learning systems
With the widespread adoption of deep learning (DL) applications in recent years, training DL models has become increasingly prevalent. Nevertheless, training these models is typically time-consuming and computation-intensive, relying heavily on expensive heterogeneous infrastructure. To facilitate m...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172372 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-172372 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computer systems organization::Computer system implementation |
spellingShingle |
Engineering::Computer science and engineering::Computer systems organization::Computer system implementation Hu, Qinghao Building efficient and practical machine learning systems |
description |
With the widespread adoption of deep learning (DL) applications in recent years, training DL models has become increasingly prevalent. Nevertheless, training these models is typically time-consuming and computation-intensive, relying heavily on expensive heterogeneous infrastructure. To facilitate model development, numerous research institutes, technology companies and cloud providers have invested substantially in establishing their large-scale GPU clusters. Regrettably, these clusters are frequently underutilized for various reasons, primarily due to (1) inefficiency: agnostic to the unique features of DL training jobs; (2) impracticality: arduous to deploy preemptive scheduling mechanisms in practice. This thesis presents a suite of techniques to tackle the challenges associated with resource management and job scheduling at the datacenter level. Moreover, we expand our research to encompass broader scenarios of machine learning systems, aiming to develop highly efficient and practical systems in real-world applications.
In the first part of this thesis, we specialize in developing tailored systems to enhance the efficiency of deep learning job execution in datacenters. To accomplish this objective, a thorough grasp of job features and user behaviors is essential. Unfortunately, prior research has only offered limited analysis of DL workloads. Therefore, we perform a comprehensive investigation into the characteristics of DL jobs and resource management within a SenseTime datacenter, containing over 6,000 GPUs. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which inspire us to manage resources based on historical data to minimize job queuing and energy consumption. Furthermore, aside from optimizing the scheduling of general training jobs, we observe a widespread occurrence of hyperparameter tuning jobs, which consume substantial resources within GPU clusters. Consequently, we build a holistic system that automatically applies the novel hyperparameter transfer theory together with multiple system techniques to jointly improve the tuning efficiency. At the job level, the system automatically scales models and optimizes tuning efficiency through inter-trial fusion, which combines multiple smaller models into a unified entity. At the cluster level, it interacts with the scheduler to dynamically allocate resources and execute trials. Specifically, it extends tuning resources by interleaving training with pipeline-enabled large model training tasks, effectively utilizing idle time intervals on each node, referred to as bubbles. Our experiments on the GPT-3 XL model demonstrate a significant acceleration in the hyperparameter tuning process, achieving a makespan reduction of 78.5x.
In the second part of this thesis, we explore and address the challenges that hinder the practical deployment of research prototypes for datacenter schedulers and other machine learning systems. Although recent research works on DL-tailored schedulers have showcased their impressive ability to enhance job efficiency, deploying them in practice poses significant challenges due to several substantial defects, including inflexible intrusive manner, exorbitant integration cost and limited scalability. To bridge these gaps, we design a non-intrusive and transparent scheduler that can provide better performance than preemptive and intrusive schedulers. It utilizes a predication-based packing strategy to circumvent interference and orchestrate resources based on estimated job priority values to achieve efficient scheduling. Additionally, in more general system-related research domains, machine learning techniques have been widely adopted to enhance system performance. However, we identify similar gaps in practical deployment, such as opaque decision processes, poor generalization and robustness, as well as exorbitant training and inference overhead. We develop a unified framework to resolve the above challenges and facilitate transparent, accurate and lightweight system with interpretable models. It optimizes both the training and post-processing stages involved in constructing learning-augmented systems. Through evaluations conducted on cutting-edge systems in storage and networking domains, our framework can provide clear model interpretations, lower deployment costs and better system performance. |
author2 |
Wen Yonggang |
author_facet |
Wen Yonggang Hu, Qinghao |
format |
Thesis-Doctor of Philosophy |
author |
Hu, Qinghao |
author_sort |
Hu, Qinghao |
title |
Building efficient and practical machine learning systems |
title_short |
Building efficient and practical machine learning systems |
title_full |
Building efficient and practical machine learning systems |
title_fullStr |
Building efficient and practical machine learning systems |
title_full_unstemmed |
Building efficient and practical machine learning systems |
title_sort |
building efficient and practical machine learning systems |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/172372 |
_version_ |
1787590739365986304 |
spelling |
sg-ntu-dr.10356-1723722024-01-04T06:32:51Z Building efficient and practical machine learning systems Hu, Qinghao Wen Yonggang Zhang Tianwei School of Computer Science and Engineering tianwei.zhang@ntu.edu.sg, YGWEN@ntu.edu.sg Engineering::Computer science and engineering::Computer systems organization::Computer system implementation With the widespread adoption of deep learning (DL) applications in recent years, training DL models has become increasingly prevalent. Nevertheless, training these models is typically time-consuming and computation-intensive, relying heavily on expensive heterogeneous infrastructure. To facilitate model development, numerous research institutes, technology companies and cloud providers have invested substantially in establishing their large-scale GPU clusters. Regrettably, these clusters are frequently underutilized for various reasons, primarily due to (1) inefficiency: agnostic to the unique features of DL training jobs; (2) impracticality: arduous to deploy preemptive scheduling mechanisms in practice. This thesis presents a suite of techniques to tackle the challenges associated with resource management and job scheduling at the datacenter level. Moreover, we expand our research to encompass broader scenarios of machine learning systems, aiming to develop highly efficient and practical systems in real-world applications. In the first part of this thesis, we specialize in developing tailored systems to enhance the efficiency of deep learning job execution in datacenters. To accomplish this objective, a thorough grasp of job features and user behaviors is essential. Unfortunately, prior research has only offered limited analysis of DL workloads. Therefore, we perform a comprehensive investigation into the characteristics of DL jobs and resource management within a SenseTime datacenter, containing over 6,000 GPUs. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which inspire us to manage resources based on historical data to minimize job queuing and energy consumption. Furthermore, aside from optimizing the scheduling of general training jobs, we observe a widespread occurrence of hyperparameter tuning jobs, which consume substantial resources within GPU clusters. Consequently, we build a holistic system that automatically applies the novel hyperparameter transfer theory together with multiple system techniques to jointly improve the tuning efficiency. At the job level, the system automatically scales models and optimizes tuning efficiency through inter-trial fusion, which combines multiple smaller models into a unified entity. At the cluster level, it interacts with the scheduler to dynamically allocate resources and execute trials. Specifically, it extends tuning resources by interleaving training with pipeline-enabled large model training tasks, effectively utilizing idle time intervals on each node, referred to as bubbles. Our experiments on the GPT-3 XL model demonstrate a significant acceleration in the hyperparameter tuning process, achieving a makespan reduction of 78.5x. In the second part of this thesis, we explore and address the challenges that hinder the practical deployment of research prototypes for datacenter schedulers and other machine learning systems. Although recent research works on DL-tailored schedulers have showcased their impressive ability to enhance job efficiency, deploying them in practice poses significant challenges due to several substantial defects, including inflexible intrusive manner, exorbitant integration cost and limited scalability. To bridge these gaps, we design a non-intrusive and transparent scheduler that can provide better performance than preemptive and intrusive schedulers. It utilizes a predication-based packing strategy to circumvent interference and orchestrate resources based on estimated job priority values to achieve efficient scheduling. Additionally, in more general system-related research domains, machine learning techniques have been widely adopted to enhance system performance. However, we identify similar gaps in practical deployment, such as opaque decision processes, poor generalization and robustness, as well as exorbitant training and inference overhead. We develop a unified framework to resolve the above challenges and facilitate transparent, accurate and lightweight system with interpretable models. It optimizes both the training and post-processing stages involved in constructing learning-augmented systems. Through evaluations conducted on cutting-edge systems in storage and networking domains, our framework can provide clear model interpretations, lower deployment costs and better system performance. Doctor of Philosophy 2023-12-07T06:15:00Z 2023-12-07T06:15:00Z 2023 Thesis-Doctor of Philosophy Hu, Q. (2023). Building efficient and practical machine learning systems. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172372 https://hdl.handle.net/10356/172372 10.32657/10356/172372 en RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |