Hardware and algorithm co-optimization for energy-efficient machine learning integrated circuits
The future of computing faces a new challenge as the computing enhancements offered by the technology scaling alone cannot address the shortage of processing capability caused by the exponential growth of data generation. The traditional Von Neumann digital architecture struggles to perform while ca...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/168401 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The future of computing faces a new challenge as the computing enhancements offered by the technology scaling alone cannot address the shortage of processing capability caused by the exponential growth of data generation. The traditional Von Neumann digital architecture struggles to perform while carrying out highly data-intensive, massively parallel operations such as deep neural network (DNN) and machine learning applications. High-speed multi-stream processors address the computing challenges by supplying raw computing power. However, deploying embedded hardware operating in the edge environment remained challenging. More specifically, in the edge environment where area and energy efficiencies are heavily emphasized, the communication bottleneck presented in the traditional architecture (i.e., Von Neumann) produces high energy consumption. To address this issue, we apply combined optimization techniques in VLSI circuit architecture and compute algorithm, effectively reducing the energy and area consumption caused by the data movements between the memory and the ALU.
CIM architecture naturally adopts conventional SRAM bitcell operation to carry out both parameter storage and processing element roles. Despite its architectural advantages in efficiency, however, electrical issues associated with the memory bitcell array significantly degrade the applicability in commercial system hardware. Issues raised in the array implementation are analogous to traditional analog computing implementation, such as variation-induced non-linearity, limited operable dynamic range in the shared bit-line, A/D overhead, etc. This work provides a solution to the abovementioned issues by imposing a digital abstraction layer on analog signals and effectively addressing such concerns. In addition, by adopting the digital computing paradigm, our work presents further design flexibility through technology/voltage/frequency scalability and compute precision reconfigurability.
We start the discussion by implementing computing-in-memory (CIM) architecture to first tackle the excessive energy consumption from data movement. Our first work, an SRAM-based CIM with pseudo-differential voltage-mode accumulators, was introduced. The design used BNN as the target DNN benchmark, and the macro was able to map 64x128 1b weights in its CIM bitcell array. Design features included reconfigurable 1-5b row-by-row ADC and residual non-linearity rejecting binary-searching based calibration scheme. Despite many features and design advantages, several concerns were raised. The proposed design could not fully address the variation-induced error, suffered from ADC overhead, and was only capable of handling low precision parameters. The design achieved 87 TOPS/W of maximum energy efficiency and 3.97 TOPS/mm2 area efficiency using 1b weights in a 128×128 array designed with 65nm CMOS.
The second work, Colonnade, attempted to address the issues in the first design work. Colonnade is also an SRAM-based CIM. However, the macro adopted the digital computing paradigm to avoid many analog related problems. Colonnade implemented a digital CIM macro that does not have data conversion overhead, is robust towards variation and noise while supporting a wide range of reconfigurable parameter precisions and DNN model architectures with a scalable 128×128 digital bitcell array. The problem demonstrated in this work was having low memory density due to high computing hardware redundancy that was inevitable when the bitcell had two more computing blocks fused in. This design achieved 117.3 TOPS/W of maximum energy efficiency and 6.75 TOPS/mm2 area efficiency using 1b weights in a 128×128 digital bitcell array designed with 65nm CMOS.
The introduction of the third work, which implements near-memory (NM) computing, somewhat alleviated the density issue. The proposed design used custom designed 7T SRAM for storing 1b weight, while sixteen such bitcells are grouped as a column MAC. A bit-serial compute block is placed in each column MAC to realize NM architecture. As a result, digital SRAM-based NM macro presented five times higher memory density than Colonnade, effectively resolving the hardware redundancy. The third design achieved 315-1.23 TOPS/W of energy efficiency and 4.3-0.270 TOPS/mm2 area efficiency using 1-16b weights in a 20×256 array built with 65nm CMOS.
We acknowledged great potential in co-optimizing the design flow with the encoding scheme and the hardware design. Tensor-Train Decomposition algorithm and quantization techniques provided a significant amount of parameter data compression that can resolve the memory density issue that was not fully mitigated in previous works. Through the ongoing research, we devised a test chip to run DNN inference with orders of magnitude lower number of stored parameters without severe degradation in performance. This work is currently under test, and only a few preliminary results are available. |
---|