Building efficient and practical machine learning systems

With the widespread adoption of deep learning (DL) applications in recent years, training DL models has become increasingly prevalent. Nevertheless, training these models is typically time-consuming and computation-intensive, relying heavily on expensive heterogeneous infrastructure. To facilitate m...

Full description

Saved in:

Bibliographic Details
Main Author:	Hu, Qinghao
Other Authors:	Wen Yonggang
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering::Computer systems organization::Computer system implementation
Online Access:	https://hdl.handle.net/10356/172372
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-172372
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computer systems organization::Computer system implementation
spellingShingle	Engineering::Computer science and engineering::Computer systems organization::Computer system implementation Hu, Qinghao Building efficient and practical machine learning systems
description	With the widespread adoption of deep learning (DL) applications in recent years, training DL models has become increasingly prevalent. Nevertheless, training these models is typically time-consuming and computation-intensive, relying heavily on expensive heterogeneous infrastructure. To facilitate model development, numerous research institutes, technology companies and cloud providers have invested substantially in establishing their large-scale GPU clusters. Regrettably, these clusters are frequently underutilized for various reasons, primarily due to (1) inefficiency: agnostic to the unique features of DL training jobs; (2) impracticality: arduous to deploy preemptive scheduling mechanisms in practice. This thesis presents a suite of techniques to tackle the challenges associated with resource management and job scheduling at the datacenter level. Moreover, we expand our research to encompass broader scenarios of machine learning systems, aiming to develop highly efficient and practical systems in real-world applications. In the first part of this thesis, we specialize in developing tailored systems to enhance the efficiency of deep learning job execution in datacenters. To accomplish this objective, a thorough grasp of job features and user behaviors is essential. Unfortunately, prior research has only offered limited analysis of DL workloads. Therefore, we perform a comprehensive investigation into the characteristics of DL jobs and resource management within a SenseTime datacenter, containing over 6,000 GPUs. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which inspire us to manage resources based on historical data to minimize job queuing and energy consumption. Furthermore, aside from optimizing the scheduling of general training jobs, we observe a widespread occurrence of hyperparameter tuning jobs, which consume substantial resources within GPU clusters. Consequently, we build a holistic system that automatically applies the novel hyperparameter transfer theory together with multiple system techniques to jointly improve the tuning efficiency. At the job level, the system automatically scales models and optimizes tuning efficiency through inter-trial fusion, which combines multiple smaller models into a unified entity. At the cluster level, it interacts with the scheduler to dynamically allocate resources and execute trials. Specifically, it extends tuning resources by interleaving training with pipeline-enabled large model training tasks, effectively utilizing idle time intervals on each node, referred to as bubbles. Our experiments on the GPT-3 XL model demonstrate a significant acceleration in the hyperparameter tuning process, achieving a makespan reduction of 78.5x. In the second part of this thesis, we explore and address the challenges that hinder the practical deployment of research prototypes for datacenter schedulers and other machine learning systems. Although recent research works on DL-tailored schedulers have showcased their impressive ability to enhance job efficiency, deploying them in practice poses significant challenges due to several substantial defects, including inflexible intrusive manner, exorbitant integration cost and limited scalability. To bridge these gaps, we design a non-intrusive and transparent scheduler that can provide better performance than preemptive and intrusive schedulers. It utilizes a predication-based packing strategy to circumvent interference and orchestrate resources based on estimated job priority values to achieve efficient scheduling. Additionally, in more general system-related research domains, machine learning techniques have been widely adopted to enhance system performance. However, we identify similar gaps in practical deployment, such as opaque decision processes, poor generalization and robustness, as well as exorbitant training and inference overhead. We develop a unified framework to resolve the above challenges and facilitate transparent, accurate and lightweight system with interpretable models. It optimizes both the training and post-processing stages involved in constructing learning-augmented systems. Through evaluations conducted on cutting-edge systems in storage and networking domains, our framework can provide clear model interpretations, lower deployment costs and better system performance.
author2	Wen Yonggang
author_facet	Wen Yonggang Hu, Qinghao
format	Thesis-Doctor of Philosophy
author	Hu, Qinghao
author_sort	Hu, Qinghao
title	Building efficient and practical machine learning systems
title_short	Building efficient and practical machine learning systems
title_full	Building efficient and practical machine learning systems
title_fullStr	Building efficient and practical machine learning systems
title_full_unstemmed	Building efficient and practical machine learning systems
title_sort	building efficient and practical machine learning systems
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/172372
_version_	1787590739365986304
spelling	sg-ntu-dr.10356-1723722024-01-04T06:32:51Z Building efficient and practical machine learning systems Hu, Qinghao Wen Yonggang Zhang Tianwei School of Computer Science and Engineering tianwei.zhang@ntu.edu.sg, YGWEN@ntu.edu.sg Engineering::Computer science and engineering::Computer systems organization::Computer system implementation With the widespread adoption of deep learning (DL) applications in recent years, training DL models has become increasingly prevalent. Nevertheless, training these models is typically time-consuming and computation-intensive, relying heavily on expensive heterogeneous infrastructure. To facilitate model development, numerous research institutes, technology companies and cloud providers have invested substantially in establishing their large-scale GPU clusters. Regrettably, these clusters are frequently underutilized for various reasons, primarily due to (1) inefficiency: agnostic to the unique features of DL training jobs; (2) impracticality: arduous to deploy preemptive scheduling mechanisms in practice. This thesis presents a suite of techniques to tackle the challenges associated with resource management and job scheduling at the datacenter level. Moreover, we expand our research to encompass broader scenarios of machine learning systems, aiming to develop highly efficient and practical systems in real-world applications. In the first part of this thesis, we specialize in developing tailored systems to enhance the efficiency of deep learning job execution in datacenters. To accomplish this objective, a thorough grasp of job features and user behaviors is essential. Unfortunately, prior research has only offered limited analysis of DL workloads. Therefore, we perform a comprehensive investigation into the characteristics of DL jobs and resource management within a SenseTime datacenter, containing over 6,000 GPUs. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which inspire us to manage resources based on historical data to minimize job queuing and energy consumption. Furthermore, aside from optimizing the scheduling of general training jobs, we observe a widespread occurrence of hyperparameter tuning jobs, which consume substantial resources within GPU clusters. Consequently, we build a holistic system that automatically applies the novel hyperparameter transfer theory together with multiple system techniques to jointly improve the tuning efficiency. At the job level, the system automatically scales models and optimizes tuning efficiency through inter-trial fusion, which combines multiple smaller models into a unified entity. At the cluster level, it interacts with the scheduler to dynamically allocate resources and execute trials. Specifically, it extends tuning resources by interleaving training with pipeline-enabled large model training tasks, effectively utilizing idle time intervals on each node, referred to as bubbles. Our experiments on the GPT-3 XL model demonstrate a significant acceleration in the hyperparameter tuning process, achieving a makespan reduction of 78.5x. In the second part of this thesis, we explore and address the challenges that hinder the practical deployment of research prototypes for datacenter schedulers and other machine learning systems. Although recent research works on DL-tailored schedulers have showcased their impressive ability to enhance job efficiency, deploying them in practice poses significant challenges due to several substantial defects, including inflexible intrusive manner, exorbitant integration cost and limited scalability. To bridge these gaps, we design a non-intrusive and transparent scheduler that can provide better performance than preemptive and intrusive schedulers. It utilizes a predication-based packing strategy to circumvent interference and orchestrate resources based on estimated job priority values to achieve efficient scheduling. Additionally, in more general system-related research domains, machine learning techniques have been widely adopted to enhance system performance. However, we identify similar gaps in practical deployment, such as opaque decision processes, poor generalization and robustness, as well as exorbitant training and inference overhead. We develop a unified framework to resolve the above challenges and facilitate transparent, accurate and lightweight system with interpretable models. It optimizes both the training and post-processing stages involved in constructing learning-augmented systems. Through evaluations conducted on cutting-edge systems in storage and networking domains, our framework can provide clear model interpretations, lower deployment costs and better system performance. Doctor of Philosophy 2023-12-07T06:15:00Z 2023-12-07T06:15:00Z 2023 Thesis-Doctor of Philosophy Hu, Q. (2023). Building efficient and practical machine learning systems. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172372 https://hdl.handle.net/10356/172372 10.32657/10356/172372 en RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Building efficient and practical machine learning systems

Similar Items