Energy efficient hardware accelerators based on memory-centric computing architecture

Application specific integrated circuit (ASIC) design has gained immense popularity in recent years due to its ability to provide tailored solutions for specific applications. ASICs are designed to perform a specific set of functions or tasks and are optimized to achieve high performance and low pow...

Full description

Saved in:
Bibliographic Details
Main Author: Yu, Chengshuo
Other Authors: Kim Tae Hyoung
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/177433
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-177433
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering
spellingShingle Engineering
Yu, Chengshuo
Energy efficient hardware accelerators based on memory-centric computing architecture
description Application specific integrated circuit (ASIC) design has gained immense popularity in recent years due to its ability to provide tailored solutions for specific applications. ASICs are designed to perform a specific set of functions or tasks and are optimized to achieve high performance and low power consumption. They can be found in a wide range of products such as smartphones, medical devices, automotive systems, and many other electronic devices. However, most recent ASIC designs that based on traditional von Neumann architecture featuring separated memory and processing units suffer from a severe performance bottleneck. In the von Neumann architecture, both data and instructions are stored in the same memory and are fetched and executed sequentially. This means that the processor needs to wait for data to be fetched from memory before it can execute instructions, and it also needs to wait for the results to be stored back in memory. This creates a bottleneck because the processor's speed is typically much faster than the speed at which data can be transferred to and from memory. The von Neumann bottleneck becomes especially apparent when dealing with complex computational tasks that require frequent data transfers between the processor and memory. As the processor becomes faster and more powerful, the performance of the system can be significantly limited by the speed of memory access. Therefore, computing architectures that are designed to match the characteristics of applications are regarded as potential candidates for enhancing hardware performance. This thesis delves into the realm of hardware accelerators, focusing on memory-centric computing approaches for processing neural networks and solving the shortest path searching problem. By prioritizing efficient memory access and leveraging processing element (PE) array-based designs, significant enhancements in circuit performance, encompassing energy efficiency and operation speed, are achieved. Within this context, the Compute-in-Memory (CIM) architecture emerges as a pivotal paradigm, garnering widespread adoption and displacing conventional von Neumann architectures in diverse applications. Its relevance extends to accelerating tasks such as artificial neural network processing, optimization problem-solving, and mathematical computations, especially in resource-constrained mobile edge computing environments. Chapter 3 presents an innovative 8T static random access memory (SRAM)-based CIM macro, specifically designed for efficient processing of neural networks while minimizing energy consumption. The proposed 8T bitcell overcomes disturb issues commonly encountered in conventional 6T bitcells by introducing two additional transistors, which effectively decouple the read channels. This enhancement ensures reliable and accurate operation. To facilitate the conversion of analog MAC results to a compact output code, a column analog-to-digital converter (ADC) employing 32× replica SRAM bitcells is employed. The ADC achieves this conversion by iteratively sweeping the reference levels across 1–31 cycles. To validate the proposed design and its energy efficiency claims, a test chip incorporating a 16K 8T SRAM bitcell array is fabricated utilizing a 65-nm manufacturing process. Experimental measurements demonstrate energy efficiency ranging from 490 to 15.8 trillion operations per second per watt (TOPS/W) for ADC resolutions spanning 1–5 bits. The core supply voltage is maintained at 0.45-/0.8-V during these measurements. With the motivation of further improving the computation throughput and energy efficiency, a novel dual 7T SRAM-based CIM macro featuring zero-skipping scheme and binary searching ADC is proposed in Chapter 4. The ADC, composed of two groups of fixed-weight bitcells, facilitates the conversion of analog dot-product results to a binary output code ranging from 1 to 5 bits. This conversion is accomplished using a binary searching scheme that operates within 1 to 5 cycles. To validate the effectiveness of the proposed design, a 65nm test chip is fabricated, featuring a dual 7T SRAM bitcell array with a capacity of 66 kilobits. The performance of the 65-nm test chip is evaluated in terms of its ability to process neural networks. Experimental measurements demonstrate energy efficiencies of 258.5/67.9/23.9 TOPS/W for weights with 3/7/15 levels, respectively. These measurements are obtained using supply voltages of 0.45V and 0.8V. Moreover, the thesis addresses the shortest path searching problem, a fundamental challenge in graph theory with broad implications spanning autonomous vehicle navigation, VLSI routing, and robotic arm manipulation. By conceptualizing maps and potential routes as graphs, efficient path-searching algorithms can be applied to identify optimal routes, thus highlighting the practical significance of memory-centric computing paradigms in diverse computational tasks. In Chapter 5, a hardware accelerator is introduced, which implements true time-domain wavefront computing within a highly parallel two-dimensional (2-D) PE array. The proposed 2-D time-domain PE array is designed with scalability and reconfigurability in mind, enabling its application in diverse scenarios. While the King's graph model is primarily employed to address the shortest path searching problem, the PE array can be reconfigured to a simpler lattice graph model, facilitating the resolution of other challenges like maze solving, which is used as a benchmark in this article. Additionally, the proposed accelerator is utilized for scientific simulations, specifically studying the propagation of circular or planar wavefronts from single or multiple start points using the King's graph configuration. To validate the efficacy of the proposed design, a test chip with a 32 × 32 reconfigurable time-domain PE array is fabricated using a 65-nm process. For a 2-D map comprising 32 × 32 vertices, the PE array consumes 776 pJ per task and achieves a search rate of 1.6 billion edges per second, utilizing core supply voltages of 1.2V and 1.0V. To sum up, three hardware accelerators based on memory-centric computing are proposed to efficiently execute computational assignments, including neural networks and shortest path searching problem. The performance of the proposed hardware accelerators is evaluated by measuring the fabricated test chip in 65nm CMOS technology.
author2 Kim Tae Hyoung
author_facet Kim Tae Hyoung
Yu, Chengshuo
format Thesis-Doctor of Philosophy
author Yu, Chengshuo
author_sort Yu, Chengshuo
title Energy efficient hardware accelerators based on memory-centric computing architecture
title_short Energy efficient hardware accelerators based on memory-centric computing architecture
title_full Energy efficient hardware accelerators based on memory-centric computing architecture
title_fullStr Energy efficient hardware accelerators based on memory-centric computing architecture
title_full_unstemmed Energy efficient hardware accelerators based on memory-centric computing architecture
title_sort energy efficient hardware accelerators based on memory-centric computing architecture
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/177433
_version_ 1806059924723073024
spelling sg-ntu-dr.10356-1774332024-06-03T06:51:19Z Energy efficient hardware accelerators based on memory-centric computing architecture Yu, Chengshuo Kim Tae Hyoung School of Electrical and Electronic Engineering THKIM@ntu.edu.sg Engineering Application specific integrated circuit (ASIC) design has gained immense popularity in recent years due to its ability to provide tailored solutions for specific applications. ASICs are designed to perform a specific set of functions or tasks and are optimized to achieve high performance and low power consumption. They can be found in a wide range of products such as smartphones, medical devices, automotive systems, and many other electronic devices. However, most recent ASIC designs that based on traditional von Neumann architecture featuring separated memory and processing units suffer from a severe performance bottleneck. In the von Neumann architecture, both data and instructions are stored in the same memory and are fetched and executed sequentially. This means that the processor needs to wait for data to be fetched from memory before it can execute instructions, and it also needs to wait for the results to be stored back in memory. This creates a bottleneck because the processor's speed is typically much faster than the speed at which data can be transferred to and from memory. The von Neumann bottleneck becomes especially apparent when dealing with complex computational tasks that require frequent data transfers between the processor and memory. As the processor becomes faster and more powerful, the performance of the system can be significantly limited by the speed of memory access. Therefore, computing architectures that are designed to match the characteristics of applications are regarded as potential candidates for enhancing hardware performance. This thesis delves into the realm of hardware accelerators, focusing on memory-centric computing approaches for processing neural networks and solving the shortest path searching problem. By prioritizing efficient memory access and leveraging processing element (PE) array-based designs, significant enhancements in circuit performance, encompassing energy efficiency and operation speed, are achieved. Within this context, the Compute-in-Memory (CIM) architecture emerges as a pivotal paradigm, garnering widespread adoption and displacing conventional von Neumann architectures in diverse applications. Its relevance extends to accelerating tasks such as artificial neural network processing, optimization problem-solving, and mathematical computations, especially in resource-constrained mobile edge computing environments. Chapter 3 presents an innovative 8T static random access memory (SRAM)-based CIM macro, specifically designed for efficient processing of neural networks while minimizing energy consumption. The proposed 8T bitcell overcomes disturb issues commonly encountered in conventional 6T bitcells by introducing two additional transistors, which effectively decouple the read channels. This enhancement ensures reliable and accurate operation. To facilitate the conversion of analog MAC results to a compact output code, a column analog-to-digital converter (ADC) employing 32× replica SRAM bitcells is employed. The ADC achieves this conversion by iteratively sweeping the reference levels across 1–31 cycles. To validate the proposed design and its energy efficiency claims, a test chip incorporating a 16K 8T SRAM bitcell array is fabricated utilizing a 65-nm manufacturing process. Experimental measurements demonstrate energy efficiency ranging from 490 to 15.8 trillion operations per second per watt (TOPS/W) for ADC resolutions spanning 1–5 bits. The core supply voltage is maintained at 0.45-/0.8-V during these measurements. With the motivation of further improving the computation throughput and energy efficiency, a novel dual 7T SRAM-based CIM macro featuring zero-skipping scheme and binary searching ADC is proposed in Chapter 4. The ADC, composed of two groups of fixed-weight bitcells, facilitates the conversion of analog dot-product results to a binary output code ranging from 1 to 5 bits. This conversion is accomplished using a binary searching scheme that operates within 1 to 5 cycles. To validate the effectiveness of the proposed design, a 65nm test chip is fabricated, featuring a dual 7T SRAM bitcell array with a capacity of 66 kilobits. The performance of the 65-nm test chip is evaluated in terms of its ability to process neural networks. Experimental measurements demonstrate energy efficiencies of 258.5/67.9/23.9 TOPS/W for weights with 3/7/15 levels, respectively. These measurements are obtained using supply voltages of 0.45V and 0.8V. Moreover, the thesis addresses the shortest path searching problem, a fundamental challenge in graph theory with broad implications spanning autonomous vehicle navigation, VLSI routing, and robotic arm manipulation. By conceptualizing maps and potential routes as graphs, efficient path-searching algorithms can be applied to identify optimal routes, thus highlighting the practical significance of memory-centric computing paradigms in diverse computational tasks. In Chapter 5, a hardware accelerator is introduced, which implements true time-domain wavefront computing within a highly parallel two-dimensional (2-D) PE array. The proposed 2-D time-domain PE array is designed with scalability and reconfigurability in mind, enabling its application in diverse scenarios. While the King's graph model is primarily employed to address the shortest path searching problem, the PE array can be reconfigured to a simpler lattice graph model, facilitating the resolution of other challenges like maze solving, which is used as a benchmark in this article. Additionally, the proposed accelerator is utilized for scientific simulations, specifically studying the propagation of circular or planar wavefronts from single or multiple start points using the King's graph configuration. To validate the efficacy of the proposed design, a test chip with a 32 × 32 reconfigurable time-domain PE array is fabricated using a 65-nm process. For a 2-D map comprising 32 × 32 vertices, the PE array consumes 776 pJ per task and achieves a search rate of 1.6 billion edges per second, utilizing core supply voltages of 1.2V and 1.0V. To sum up, three hardware accelerators based on memory-centric computing are proposed to efficiently execute computational assignments, including neural networks and shortest path searching problem. The performance of the proposed hardware accelerators is evaluated by measuring the fabricated test chip in 65nm CMOS technology. Doctor of Philosophy 2024-05-26T23:23:06Z 2024-05-26T23:23:06Z 2024 Thesis-Doctor of Philosophy Yu, C. (2024). Energy efficient hardware accelerators based on memory-centric computing architecture. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/177433 https://hdl.handle.net/10356/177433 10.32657/10356/177433 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University