Stitching weight-shared deep neural networks for efficient multitask inference on GPU

Intelligent personal and home applications demand multiple deep neural networks (DNNs) running on resourceconstrained platforms for compound inference tasks, known as multitask inference. To fit multiple DNNs into low-resource devices, emerging techniques resort to weight sharing among DNNs to reduc...

Full description

Saved in:

Bibliographic Details
Main Authors:	WANG, Zeyu, HE, Xiaoxi, ZHOU, Zimu, WANG, Xu, MA, Qiang, MIAO, Xin, LIU, Zhuo, THIELE, Lothar, YANG, Zheng.
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Deep Neural Networks Multitask Inference Model Acceleration OS and Networks Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/7486 https://ink.library.smu.edu.sg/context/sis_research/article/8489/viewcontent/secon22_wang.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8489
record_format	dspace
spelling	sg-smu-ink.sis_research-84892022-11-03T06:35:25Z Stitching weight-shared deep neural networks for efficient multitask inference on GPU WANG, Zeyu HE, Xiaoxi ZHOU, Zimu WANG, Xu MA, Qiang MIAO, Xin LIU, Zhuo THIELE, Lothar YANG, Zheng. Intelligent personal and home applications demand multiple deep neural networks (DNNs) running on resourceconstrained platforms for compound inference tasks, known as multitask inference. To fit multiple DNNs into low-resource devices, emerging techniques resort to weight sharing among DNNs to reduce their storage. However, such reduction in storage fails to translate into efficient execution on common accelerators such as GPUs. Most DNN graph rewriters are blind for multiDNN optimization, while GPU vendors provide inefficient APIs for parallel multi-DNN execution at runtime. A few prior graph rewriters suggest cross-model graph fusion for low-latency multiDNN execution. Yet they request duplication of the shared weights, erasing the memory saving of weight-shared DNNs. In this paper, we propose MTS, a novel graph rewriter for efficient multitask inference with weight-shared DNNs. MTS adopts a model stitching algorithm which outputs a single computational graph for weight-shared DNNs without duplicating any shared weight. MTS also utilizes a model grouping strategy to avoid overwhelming the GPU when co-running tens of DNNs. Extensive experiments show that MTS accelerates multitask inference by up to 6.0× compared to sequentially executing multiple weightshared DNNs. MTS also yields up to 2.5× lower latency and 3.7× less memory usage compared with NETFUSE, a state-of-the-art multi-DNN graph rewriter. 2022-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7486 info:doi/10.1109/SECON55815.2022.9918563 https://ink.library.smu.edu.sg/context/sis_research/article/8489/viewcontent/secon22_wang.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Deep Neural Networks Multitask Inference Model Acceleration OS and Networks Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Deep Neural Networks Multitask Inference Model Acceleration OS and Networks Software Engineering
spellingShingle	Deep Neural Networks Multitask Inference Model Acceleration OS and Networks Software Engineering WANG, Zeyu HE, Xiaoxi ZHOU, Zimu WANG, Xu MA, Qiang MIAO, Xin LIU, Zhuo THIELE, Lothar YANG, Zheng. Stitching weight-shared deep neural networks for efficient multitask inference on GPU
description	Intelligent personal and home applications demand multiple deep neural networks (DNNs) running on resourceconstrained platforms for compound inference tasks, known as multitask inference. To fit multiple DNNs into low-resource devices, emerging techniques resort to weight sharing among DNNs to reduce their storage. However, such reduction in storage fails to translate into efficient execution on common accelerators such as GPUs. Most DNN graph rewriters are blind for multiDNN optimization, while GPU vendors provide inefficient APIs for parallel multi-DNN execution at runtime. A few prior graph rewriters suggest cross-model graph fusion for low-latency multiDNN execution. Yet they request duplication of the shared weights, erasing the memory saving of weight-shared DNNs. In this paper, we propose MTS, a novel graph rewriter for efficient multitask inference with weight-shared DNNs. MTS adopts a model stitching algorithm which outputs a single computational graph for weight-shared DNNs without duplicating any shared weight. MTS also utilizes a model grouping strategy to avoid overwhelming the GPU when co-running tens of DNNs. Extensive experiments show that MTS accelerates multitask inference by up to 6.0× compared to sequentially executing multiple weightshared DNNs. MTS also yields up to 2.5× lower latency and 3.7× less memory usage compared with NETFUSE, a state-of-the-art multi-DNN graph rewriter.
format	text
author	WANG, Zeyu HE, Xiaoxi ZHOU, Zimu WANG, Xu MA, Qiang MIAO, Xin LIU, Zhuo THIELE, Lothar YANG, Zheng.
author_facet	WANG, Zeyu HE, Xiaoxi ZHOU, Zimu WANG, Xu MA, Qiang MIAO, Xin LIU, Zhuo THIELE, Lothar YANG, Zheng.
author_sort	WANG, Zeyu
title	Stitching weight-shared deep neural networks for efficient multitask inference on GPU
title_short	Stitching weight-shared deep neural networks for efficient multitask inference on GPU
title_full	Stitching weight-shared deep neural networks for efficient multitask inference on GPU
title_fullStr	Stitching weight-shared deep neural networks for efficient multitask inference on GPU
title_full_unstemmed	Stitching weight-shared deep neural networks for efficient multitask inference on GPU
title_sort	stitching weight-shared deep neural networks for efficient multitask inference on gpu
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7486 https://ink.library.smu.edu.sg/context/sis_research/article/8489/viewcontent/secon22_wang.pdf
_version_	1770576355904192512

Stitching weight-shared deep neural networks for efficient multitask inference on GPU

Similar Items