Timed dataflow: Reducing communication overhead for distributed machine learning systems

Many distributed machine learning (ML) systems exhibit high communication overhead when dealing with big data sets. Our investigations showed that popular distributed ML systems could spend about an order of magnitude more time on network communication than computation to train ML models containing...

Full description

Saved in:

Bibliographic Details
Main Authors:	SUN, Peng, WEN, Yonggang, TA, Nguyen Binh Duong, YAN, Shengen
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2016
Subjects:	Computer and Systems Architecture Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/4834
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-5837
record_format	dspace
spelling	sg-smu-ink.sis_research-58372020-01-16T09:18:03Z Timed dataflow: Reducing communication overhead for distributed machine learning systems SUN, Peng WEN, Yonggang TA, Nguyen Binh Duong YAN, Shengen Many distributed machine learning (ML) systems exhibit high communication overhead when dealing with big data sets. Our investigations showed that popular distributed ML systems could spend about an order of magnitude more time on network communication than computation to train ML models containing millions of parameters. Such high communication overhead is mainly caused by two operations: pulling parameters and pushing gradients. In this paper, we propose an approach called Timed Dataflow (TDF) to deal with this problem via reducing network traffic using three techniques: a timed parameter storage system, a hybrid parameter filter and a hybrid gradient filter. In particular, the timed parameter storage technique and the hybrid parameter filter enable servers to discard unchanged parameters during the pull operation, and the hybrid gradient filter allows servers to drop gradients selectively during the push operation. Therefore, TDF could reduce the network traffic and communication time significantly. Extensive performance evaluations in a real testbed showed that TDF could reduce up to 77% and 79% of network traffic for the pull and push operations, respectively. As a result, TDF could speed up model training by a factor of up to 4 without sacrificing much accuracy for some popular ML models, compared to systems not using TDF. 2016-12-16T08:00:00Z text https://ink.library.smu.edu.sg/sis_research/4834 info:doi/10.1109/ICPADS.2016.0146 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computer and Systems Architecture Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Computer and Systems Architecture Software Engineering
spellingShingle	Computer and Systems Architecture Software Engineering SUN, Peng WEN, Yonggang TA, Nguyen Binh Duong YAN, Shengen Timed dataflow: Reducing communication overhead for distributed machine learning systems
description	Many distributed machine learning (ML) systems exhibit high communication overhead when dealing with big data sets. Our investigations showed that popular distributed ML systems could spend about an order of magnitude more time on network communication than computation to train ML models containing millions of parameters. Such high communication overhead is mainly caused by two operations: pulling parameters and pushing gradients. In this paper, we propose an approach called Timed Dataflow (TDF) to deal with this problem via reducing network traffic using three techniques: a timed parameter storage system, a hybrid parameter filter and a hybrid gradient filter. In particular, the timed parameter storage technique and the hybrid parameter filter enable servers to discard unchanged parameters during the pull operation, and the hybrid gradient filter allows servers to drop gradients selectively during the push operation. Therefore, TDF could reduce the network traffic and communication time significantly. Extensive performance evaluations in a real testbed showed that TDF could reduce up to 77% and 79% of network traffic for the pull and push operations, respectively. As a result, TDF could speed up model training by a factor of up to 4 without sacrificing much accuracy for some popular ML models, compared to systems not using TDF.
format	text
author	SUN, Peng WEN, Yonggang TA, Nguyen Binh Duong YAN, Shengen
author_facet	SUN, Peng WEN, Yonggang TA, Nguyen Binh Duong YAN, Shengen
author_sort	SUN, Peng
title	Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_short	Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_full	Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_fullStr	Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_full_unstemmed	Timed dataflow: Reducing communication overhead for distributed machine learning systems
title_sort	timed dataflow: reducing communication overhead for distributed machine learning systems
publisher	Institutional Knowledge at Singapore Management University
publishDate	2016
url	https://ink.library.smu.edu.sg/sis_research/4834
_version_	1770575058050220032

Timed dataflow: Reducing communication overhead for distributed machine learning systems

Similar Items