Adaptive neural networks for edge intelligence
Deep neural networks (DNNs) have achieved remarkable results and have become the mainstay of many applications including autonomous driving and emerging AI-enabled chatbots. However, the superior performance of advanced DNN models comes at the cost of enormous computation and memory footprint. For i...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172595 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Deep neural networks (DNNs) have achieved remarkable results and have become the mainstay of many applications including autonomous driving and emerging AI-enabled chatbots. However, the superior performance of advanced DNN models comes at the cost of enormous computation and memory footprint. For instance, ChatGPT enabled by the GPT-3.5 model breaks the records of multiple benchmarks with 175 Billion parameters, which puts a significant strain on hardware capabilities. To satisfy the resource consumption of DNNs and efficiently execute DNNs to process input data, the traditional paradigm of AI hosts DNN models on powerful cloud servers, where data of users will be uploaded to the cloud for processing and the results will be returned to users. Inevitably, this mode aggravates users’ concern about the leakage of data privacy. To address this concern, a new paradigm has emerged that proposes deploying DNNs to edge devices near users to process private data securely. However, edge devices are usually sensitive to resource consumption, and different edge devices are distinct from each other in terms of available resources and computing capability. Therefore, how to adapt DNNs to efficiently utilize the resources of various edge hardware for better performance becomes urgent for edge intelligence. To answer this question, we have been focusing on the design and efficient adaptation (e.g., compression and scaling) of DNN models, so that the execution overhead of models can be matched to the given hardware resources to achieve the best trade-off between execution efficiency and prediction accuracy. First, to comprehensively understand the hardware, we introduce EDLAB, an end-to-end benchmark, to evaluate edge deep learning accelerators. EDLAB consists of state-of-the-art deep learning models, a unified workload preprocessing and deployment framework, as well as a collection of comprehensive metrics. In EDLAB, we also propose parameterized models to model the hardware performance bound, so that EDLAB can identify the hardware potentials and the hardware utilization of different deep learning applications. After evaluating hardware devices, we will adapt DNNs accordingly to efficiently utilize available hardware resources for better performance. Specifically, for powerful edge devices that have adequate resources, we propose HACScale and AdaptScale to efficiently scale DNN models for better accuracy without sacrificing execution efficiency. HACScale is a hardware-aware scaling framework, which jointly scales different dimensions of a model according to their impact on resource utilization and accuracy. When applying HACScale to different models, we observe that the optimal scaling strategy of different models is very distinct, and sharing the same scaling strategy across different models may not achieve the best accuracy and resource utilization. Therefore, we further propose AdaptScale, a model-aware adaptive scaling framework to efficiently customize the scaling strategy for different models for the best model performance. Meanwhile, for less capable edge devices, we also introduce two novel model compression frameworks: TECO and TICO, to reduce the costs of DNN models so that they can be efficiently deployed onto resource-constrained edge devices. TECO is a multi-dimensional model pruning framework. Compared to existing pruning frameworks that only prune a single dimension of DNN models, TECO collaboratively prunes multiple dimensions (i.e., depth, width, and resolution) to more comprehensively reduce redundant parameters and computation for higher execution efficiency. In TECO, we first introduce a two-stage importance evaluation framework, which efficiently and comprehensively evaluates each pruning unit according to both the local importance inside each dimension and the global importance across different dimensions. Based on the evaluation framework, we present a heuristic pruning algorithm to progressively prune the three dimensions of CNNs towards the optimal trade-off between accuracy and efficiency. In addition, we find that existing compression approaches mainly focus on reducing the inference overhead of models while ignoring the training overhead, which loses the opportunity to update the deployed model with private data on edge devices due to the huge training cost. To address this issue, we propose TICO, a co-optimization framework to optimize both the training and inference performance of deep learning models. In TICO, we first introduce a novel multi-objective pruning approach, where we take both training and inference performance as optimization objectives, and then formulate the pruning of a model as a multi-objective optimization problem. Subsequently, we design an evolutionary algorithm to efficiently search for the optimal pruning decision. Moreover, to further compress the training cost, we also propose a resolution-adaptive training strategy, which trains models with a small image size at early training epochs and progressively increases the size of training images. Compared to the traditional training paradigm which trains a model with the same large image size throughout the whole training process, our approach significantly reduces the training cost and improves the training performance of models on edge devices. In addition to optimizing the efficiency of DNN models at design time, we also propose EdgeCompress, a dynamic inference framework to avoid unnecessary computation of DNN models at inference time. In EdgeCompress, we first introduce dynamic image cropping, where we design a lightweight foreground predictor to accurately crop the most informative foreground object of input images for inference, which avoids redundant computation on background regions. Subsequently, we present compound shrinking to collaboratively compress the three dimensions (depth, width, and resolution) of CNNs. Dynamic image cropping and compound shrinking together constitute a multi-dimensional CNN compression framework, which is able to comprehensively reduce the computational redundancy in both input images and neural network architectures, thereby improving the inference efficiency of CNNs. Further, we present a dynamic inference framework to efficiently process input images with different recognition difficulties, where we cascade multiple models with different complexities from our compression framework and dynamically adopt different models for different input images, which further compresses the computational redundancy and improves the inference efficiency of CNNs, facilitating the deployment of advanced CNNs onto embedded hardware. |
---|