JALAD : joint accuracy- and latency-aware deep structure decoupling for edge-cloud execution

Recent years have witnessed a rapid growth of deep-network based services and applications. A practical and critical problem thus has emerged: how to effectively deploy the deep neural network models such that they can be executed efficiently. Conventional cloud-based approaches usually run the deep...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Hongshan, Hu, Chenghao, Jiang, Jingyan, Wang, Zhi, Wen, Yonggang, Zhu, Wenwu
Other Authors: School of Computer Science and Engineering
Format: Conference or Workshop Item
Language:English
Published: 2020
Subjects:
Online Access:https://hdl.handle.net/10356/143195
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Recent years have witnessed a rapid growth of deep-network based services and applications. A practical and critical problem thus has emerged: how to effectively deploy the deep neural network models such that they can be executed efficiently. Conventional cloud-based approaches usually run the deep models in data center servers, causing large latency because a significant amount of data has to be transferred from the edge of network to the data center. In this paper, we propose JALAD, a joint accuracy- and latency-aware execution framework, which decouples a deep neural network so that a part of it will run at edge devices and the other part inside the conventional cloud, while only a minimum amount of data has to be transferred between them. Though the idea seems straightforward, we are facing challenges including i)how to find the best partition of a deep structure; ii)how to deploy the component at an edge device that only has limited computation power; and iii)how to minimize the overall execution latency. Our answers to these questions are a set of strategies in JALAD, including 1)A normalization based in-layer data compression strategy by jointly considering compression rate and model accuracy; 2)A latency-aware deep decoupling strategy to minimize the overall execution latency; and 3)An edge-cloud structure adaptation strategy that dynamically changes the decoupling for different network conditions. Experiments demonstrate that our solution can significantly reduce the execution latency: it speeds up the overall inference execution with a guaranteed model accuracy loss.