Hardware-aware neural architecture search and compression towards embedded intelligence

With the increasing availability of large-scale datasets and powerful computing paradigms, convolutional neural networks (CNNs) have empowered a wide range of intelligent embedded vision tasks, which span from image classification to downstream vision tasks, such as on-device object recognition, det...

Full description

Saved in:
Bibliographic Details
Main Author: Luo, Xiangzhong
Other Authors: Weichen Liu
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172506
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With the increasing availability of large-scale datasets and powerful computing paradigms, convolutional neural networks (CNNs) have empowered a wide range of intelligent embedded vision tasks, which span from image classification to downstream vision tasks, such as on-device object recognition, detection, and tracking. In the past few years, convolutional networks have been evolving deeper and wider in order to maintain superior accuracy on target task. This rule of thumb, despite its efficacy, leads to an exponential growth in the number of floating-point operations (FLOPs) and parameters. For example, ResNet50, as one of the most representative convolutional networks, consists of over 4 billion FLOPs and 25 million parameters. The prohibitive network complexity, as a result, further enlarges the computational gap between computation-intensive CNNs and resource-constrained embedded platforms, making it challenging to develop hardware-friendly network solutions to accommodate the limited available computational resources in real-world embedded scenarios towards embedded intelligence. This thesis focuses on alleviating the above computational gap from the perspective of hardware-aware neural architecture search (NAS) and compression. First of all, we introduce SurgeNAS for efficient architecture search. Specifically, SurgeNAS turns back to one-level optimization for accurate and consistent gradient estimation, which also features an effective identity mapping scheme in order to avoid the search collapse. In addition, we introduce an efficient ordered differentiable sampling approach to reduce the memory consumption to the single-path level, while at the same time maintaining strict search fairness. An efficient graph neural networks (GNNs) based latency predictor is further proposed and integrated into the search engine to avoid tedious on-device latency measurements during the search process. Finally, we introduce the paradigm of Comfort Zone, which allows us to scale up the searched architecture candidates to achieve better accuracy on target task without degrading the inference efficiency on target hardware. Furthermore, we introduce LightNAS for flexible architecture search. The motivation behind LightNAS is that previous relevant NAS methods, including SurgeNAS, simply focus on reducing the explicit search cost -- the time for one single search, while ignoring the huge implicit search cost -- the time for manual hyper-parameter tuning to derive the required architecture candidate. In practice, previous relevant NAS methods have to perform manual hyper-parameter tuning in order to navigate the required architecture candidate that satisfies the specified latency constraint, which empirically involves 10 trial-and-errors and thus significantly increases the total search cost by 10 times. In contrast, LightNAS only requires one single search for any specified latency constraint (i.e., you only search once). In addition, we introduce an efficient yet reliable proxy, namely batchwise training estimation (BTE), which can be seamlessly integrated into LightNAS to enable channel-level explorations at low computational cost. This further boosts the attainable accuracy on target task without degrading the efficiency on target hardware. Finally, we introduce Domino for efficient network compression, in which we pioneer to revisit the trade-off dilemma between accuracy and efficiency from a fresh perspective of linearity and non-linearity. Specifically, Domino focuses on trading the less important network non-linearity for better network efficiency. To this end, Domino leverages two efficient performance predictors, including one vanilla latency predictor and one meta-accuracy predictor, to explore less important non-linear building blocks, which are then grafted with their linear counterparts. The resulting grafted network is further trained on target task to achieve decent accuracy. Finally, we reparameterize each grafted linear building block that consists of multiple consecutive linear layers, including multiple convolutional, batch normalization (BN), and grafted linear activation layers, into one single convolutional layer to aggressively boost the efficiency on target hardware, and more importantly, without sacrificing the accuracy on target task since the network maintains the same output regardless of linear reparameterization. In summary, this thesis focuses on hardware-aware neural architecture search and compression to deliver efficient network solutions for resource-constrained embedded platforms to empower embedded intelligence. Future research will continue to explore more general search spaces and more advanced search/compression techniques to develop more efficient networks for intelligent embedded applications.