JALAD : joint accuracy- and latency-aware deep structure decoupling for edge-cloud execution
Recent years have witnessed a rapid growth of deep-network based services and applications. A practical and critical problem thus has emerged: how to effectively deploy the deep neural network models such that they can be executed efficiently. Conventional cloud-based approaches usually run the deep...
Saved in:
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/143195 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Recent years have witnessed a rapid growth of deep-network based services and applications. A practical and critical problem thus has emerged: how to effectively deploy the deep neural network models such that they can be executed efficiently. Conventional cloud-based approaches usually run the deep models in data center servers, causing large latency because a significant amount of data has to be transferred from the edge of network to the data center. In this paper, we propose JALAD, a joint accuracy- and latency-aware execution framework, which decouples a deep neural network so that a part of it will run at edge devices and the other part inside the conventional cloud, while only a minimum amount of data has to be transferred between them. Though the idea seems straightforward, we are facing challenges including i)how to find the best partition of a deep structure; ii)how to deploy the component at an edge device that only has limited computation power; and iii)how to minimize the overall execution latency. Our answers to these questions are a set of strategies in JALAD, including 1)A normalization based in-layer data compression strategy by jointly considering compression rate and model accuracy; 2)A latency-aware deep decoupling strategy to minimize the overall execution latency; and 3)An edge-cloud structure adaptation strategy that dynamically changes the decoupling for different network conditions. Experiments demonstrate that our solution can significantly reduce the execution latency: it speeds up the overall inference execution with a guaranteed model accuracy loss. |
---|