Timed dataflow: Reducing communication overhead for distributed machine learning systems

Many distributed machine learning (ML) systems exhibit high communication overhead when dealing with big data sets. Our investigations showed that popular distributed ML systems could spend about an order of magnitude more time on network communication than computation to train ML models containing...

Full description

Saved in:
Bibliographic Details
Main Authors: SUN, Peng, WEN, Yonggang, TA, Nguyen Binh Duong, YAN, Shengen
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2016
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4834
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5837
record_format dspace
spelling sg-smu-ink.sis_research-58372020-01-16T09:18:03Z Timed dataflow: Reducing communication overhead for distributed machine learning systems SUN, Peng WEN, Yonggang TA, Nguyen Binh Duong YAN, Shengen Many distributed machine learning (ML) systems exhibit high communication overhead when dealing with big data sets. Our investigations showed that popular distributed ML systems could spend about an order of magnitude more time on network communication than computation to train ML models containing millions of parameters. Such high communication overhead is mainly caused by two operations: pulling parameters and pushing gradients. In this paper, we propose an approach called Timed Dataflow (TDF) to deal with this problem via reducing network traffic using three techniques: a timed parameter storage system, a hybrid parameter filter and a hybrid gradient filter. In particular, the timed parameter storage technique and the hybrid parameter filter enable servers to discard unchanged parameters during the pull operation, and the hybrid gradient filter allows servers to drop gradients selectively during the push operation. Therefore, TDF could reduce the network traffic and communication time significantly. Extensive performance evaluations in a real testbed showed that TDF could reduce up to 77% and 79% of network traffic for the pull and push operations, respectively. As a result, TDF could speed up model training by a factor of up to 4 without sacrificing much accuracy for some popular ML models, compared to systems not using TDF. 2016-12-16T08:00:00Z text https://ink.library.smu.edu.sg/sis_research/4834 info:doi/10.1109/ICPADS.2016.0146 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computer and Systems Architecture Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Computer and Systems Architecture
Software Engineering
spellingShingle Computer and Systems Architecture
Software Engineering
SUN, Peng
WEN, Yonggang
TA, Nguyen Binh Duong
YAN, Shengen
Timed dataflow: Reducing communication overhead for distributed machine learning systems
description Many distributed machine learning (ML) systems exhibit high communication overhead when dealing with big data sets. Our investigations showed that popular distributed ML systems could spend about an order of magnitude more time on network communication than computation to train ML models containing millions of parameters. Such high communication overhead is mainly caused by two operations: pulling parameters and pushing gradients. In this paper, we propose an approach called Timed Dataflow (TDF) to deal with this problem via reducing network traffic using three techniques: a timed parameter storage system, a hybrid parameter filter and a hybrid gradient filter. In particular, the timed parameter storage technique and the hybrid parameter filter enable servers to discard unchanged parameters during the pull operation, and the hybrid gradient filter allows servers to drop gradients selectively during the push operation. Therefore, TDF could reduce the network traffic and communication time significantly. Extensive performance evaluations in a real testbed showed that TDF could reduce up to 77% and 79% of network traffic for the pull and push operations, respectively. As a result, TDF could speed up model training by a factor of up to 4 without sacrificing much accuracy for some popular ML models, compared to systems not using TDF.
format text
author SUN, Peng
WEN, Yonggang
TA, Nguyen Binh Duong
YAN, Shengen
author_facet SUN, Peng
WEN, Yonggang
TA, Nguyen Binh Duong
YAN, Shengen
author_sort SUN, Peng
title Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_short Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_full Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_fullStr Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_full_unstemmed Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_sort timed dataflow: reducing communication overhead for distributed machine learning systems
publisher Institutional Knowledge at Singapore Management University
publishDate 2016
url https://ink.library.smu.edu.sg/sis_research/4834
_version_ 1770575058050220032