Energy-efficient hardware accelerators based on bit-serial graph and memory-centric computing architectures

As semiconductor process technology nodes have shrunk over the past few decades, the complexity of application-specific integrated circuits (ASICs) has grown significantly. Emerging ASICs have been widely explored to accelerate various algorithms with high energy efficiency, including machine learni...

Full description

Saved in:
Bibliographic Details
Main Author: Mu, Junjie
Other Authors: Kim Tae Hyoung
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165577
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:As semiconductor process technology nodes have shrunk over the past few decades, the complexity of application-specific integrated circuits (ASICs) has grown significantly. Emerging ASICs have been widely explored to accelerate various algorithms with high energy efficiency, including machine learning and scientific computing. However, the hardware based on the conventional von Neumann architecture with separated memory and processing units faces a critical performance bottleneck due to huge memory access energy and high communication bandwidth requirement. Computing architectures that conform to the characteristics of the algorithms stand out as promising candidates for improving hardware performance. This thesis focuses on the design of hardware accelerators based on memory-centric computing and bit-serial computing graph architectures. They improve circuit performance remarkably by minimizing memory access and localizing computation. Memory-centric computing aims at reducing latency and minimizing memory access by processing data where it resides. It is utilized to perform computing tasks where a large amount of data is communicated between computing units and memory. Graph computing can handle iterative problems requiring surrounding values well by establishing a communication channel between adjacent cells. The first half of this thesis demonstrates a novel hardware design for solving combinatorial optimization problems (COPs) with high energy efficiency. Combinatorial optimization has noteworthy applications in several fields, including artificial intelligence, software engineering, VLSI, and theoretical computer science. Conventional von Neumann computers face the challenges of huge energy consumption and high computing power in solving non-deterministic polynomial-time hardness COPs since the time consumed to find solutions grows exponentially as the number of variables increases. In Chapter 3, we present a hybrid analog-digital implementation of an annealing computer based on in-memory computing and King’s graph model to solve COPs. In-memory computing addresses the von Neumann bottleneck by bringing the processing task into the memory array to minimize memory access. The King’s graph structure localizes computation and maximizes the number of neighbors per cell in a two-dimensional (2D) plane. The proposed annealing computer also realizes significant improvements in area and programmability compared to fully digital and analog implementations. The test chip fabricated in 65nm CMOS demonstrates that the proposed accelerator consumes 9.9mW at 0.8V and 320MHz. Partial differential equations (PDEs) are widely used in physics and engineering, such as heat conduction, fluid mechanics and electrodynamics, and quantum mechanics. Solving PDEs numerically is computationally expensive due to massive iterations with huge data communication. Meanwhile, it demands high precision to guarantee accurate solutions and convergence. To overcome the challenges, we propose graph-based accelerators with a 2D processing element (PE) array to solve PDEs. In the second half of the thesis, we propose two all-digital bit-serial computing graph-based accelerators to solve PDEs using the finite difference method (FDM) and a checkerboard grid update method. Chapter 4 presents a graph hardware accelerator for solving 2D PDEs using residue-based FDM that improves energy efficiency with dynamic precision control. The proposed accelerator enables massive parallelism using the checkerboard update method and minimizes data communication by mapping equations into a 2D grid and localizing the computation. The bit-serial computing scheme reduces the area overhead and communication bandwidth requirement. The accelerator integrated in a silicon area of 0.462mm2 consumes 1.59nJ per iteration at 16b precision, 1V, and 25.6MHz. With the motivation of the lack of reconfigurability and scalability of the existing PDE hardware solvers, we propose the first single chip to solve 2D and 3D PDEs using a full-digital, graph-based architecture in Chapter 5. Besides bit-serial graph computing, near-memory computing is exploited to solve PDEs to reduce the cost of expensive data movement. The test chip consumes 5.2pJ and 6.5pJ per update for solving 2D and 3D PDEs at 1V and 25.6MHz with a core area of 0.811mm2. In summary, three hardware accelerators based on memory-centric computing and bit-serial computing graph-based architectures are presented to efficiently perform computational tasks, including solving COPs and PDEs. The performance of the proposed accelerators is evaluated by the measurement results on test chips fabricated in 65nm CMOS technology.