Deep learning acceleration: from quantization to in-memory computing
Deep learning has demonstrated high accuracy and efficiency in various applications. For example, Convolutional Neural Networks (CNNs) widely adopted in Computer Vision (CV) and Transformers broadly applied in Natural Language Processing (NLP) are representative deep learning models. Deep learning m...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/163448 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Deep learning has demonstrated high accuracy and efficiency in various applications. For example, Convolutional Neural Networks (CNNs) widely adopted in Computer Vision (CV) and Transformers broadly applied in Natural Language Processing (NLP) are representative deep learning models. Deep learning models have grown deeper and larger in the past few years to obtain higher accuracy. Meanwhile, these deep learning models bring challenges to inference on the edge. These computational-intensive and memory-intensive deep learning models not only are bounded by limited computational resources but also suffer from the long latency and high energy of heavy memory access. Therefore, accelerating deep learning inference on the edge need software/hardware co-optimization.
From the software perspective, thanks to the fault-tolerance nature of deep learning models, quantizing the 32-bit values to low-bitwidth values effectively reduces the model size and the computational complexity. Ternary and binary neural networks are representative quantized networks that achieve 16-32X model size reduction and up to 64X theoretical speedup. However, due to inefficient encoding and dot product, the ternary and binary low-bitwidth storage schemes and arithmetic operations are inefficient on Central Processing Unit (CPU) and Graphic Processing Unit (GPU) platforms. Existing ternary and binary encoding schemes are complex and incompatible. In addition, current ternary and binary dot products contain redundant operations, and mixed-precision ternary and binary dot products are missing.
Among various deep learning models, Ternary Weight Network (TWN) and Adder Neural Network (AdderNet) are two other promising neural networks with higher accuracy than ternary and binary neural networks. Moreover, compared with integer quantization and full-precision models, TWN and AdderNet have a unique advantage: they replace the multiplication operations with lightweight addition and subtraction operations, which are favoured by In-Memory Computing (IMC) architectures.
From the hardware perspective, IMC architectures compute inside the Non-Volatile Memory (NVM) arrays to reduce the data movement overhead. IMC architectures conduct addition and boolean operations in parallel, which is excellent for accelerating addition-centric deep learning models like TWNs and AdderNet. However, the addition and subtraction operators and data mapping schemes for deep learning models on existing IMC designs are not fully optimized.
In this thesis, we accelerate deep learning inference from both software and hardware perspectives. Firstly, on the software side, we propose TAB to accelerate quantized ternary and binary deep learning models on the edge. First, we propose a unified value representation based on standard signed integer encoding. Second, we introduce a bitwidth-last data storage format to avoid the overhead of extracting the sign bit. Third, we propose ternary and binary bitwise dot products based on Gated-XOR, reducing 25% to 61% operations than State-Of-The-Art (SOTA) methods. Finally, we implement TAB on both CPU and GPU platforms as an open-source library with optimized bitwise kernels. Experiment results show that TAB's ternary and binary neural networks achieve up to 34.6X to 72.2X speedup than full-precision ones.
Next, on the hardware side, we propose an in-memory accelerator FAT for TWNs with three contributions: a fast addition scheme that can avoid the time overhead of carry propagation and writing back, a sparse addition control unit utilizing the sparsity to skip operations on zero weights, and a combined-stationary data mapping to reduce the data movement and increase the parallelism across memory columns. Compared with SOTA IMC accelerators, FAT achieves 10.02X speedup and 12.19X energy efficiency on networks with 80% average sparsity.
Last, we propose another in-memory accelerator iMAD for AdderNet. First, we co-optimize in-memory subtraction and addition operators to reduce the latency, energy, and sensing circuit area. Second, we design an accelerator architecture for AdderNet with high parallelism based on the optimized operators. Third, we propose an IMC-friendly computation pipeline for AdderNet convolution at the algorithm level to further boost the performance. Evaluation results show that our accelerator iMAD achieves 3.25X speedup and 3.55X energy efficiency compared with a SOTA in-memory accelerator.
In summary, we accelerate deep learning models through software/hardware co-design. We propose a unified and optimized ternary and binary inference framework with unified encoding, optimized data storage, efficient bitwise dot product, and a programming library on existing CPU and GPU platforms. We further propose two hardware accelerators for TWNs and AdderNet with optimized operators, architectures, algorithms, and data mapping schemes on emerging in-memory computing platforms. In the future, we will extend the in-memory computing architectures to accelerate other types of deep learning models, for example, Transformers. We will also research general-purpose in-memory computing by integrating lightweight RISC-V CPU cores with computational memory arrays. |
---|