Modelling, exploration and optimization of hardware accelerators for deep learning applications

Current applications that require processing of large amounts of data, such as in healthcare, transportation, media, banking, telecom, internet-of-things, and security demand for new computing systems with extreme performance and energy efficiency. Several advancements in general-purpose computing...

Full description

Saved in:
Bibliographic Details
Main Author: Dutt, Arko
Other Authors: Mohamed M. Sabry Aly
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/164987
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Current applications that require processing of large amounts of data, such as in healthcare, transportation, media, banking, telecom, internet-of-things, and security demand for new computing systems with extreme performance and energy efficiency. Several advancements in general-purpose computing (like General Purpose Graphics Processing Unit) and new custom hardware (like Tensor Processing Unit) are proposed to meet the maximum performance needs. These computing systems are still bottle-necked by under-achieved power-efficiency due to excessive data transfers. Though a lot of new computing architectures are emerging in academia and industry targeting efficient processing of application workloads, it takes considerable amount of time to decide on the most efficient hardware solution. Additionally, computer architects cannot guarantee which hardware is the best in terms of maximum performance with least energy consumption. An efficient computing system will require co-optimizations at device-, circuit, architecture- and system-levels. A toolchain or a unified tool that can instantaneously and accurately simulate the hardware costs targeting an application workload, can therefore accelerate the design and optimization of efficient computing systems much before its actual realization. In this dissertation, we introduced and presented two mechanisms to accelerate the estimation or simulation of the cost metrics of hardware accelerators targeting deep learning workloads, such energy consumption, performance, energy-delay-product (EDP) and so on. Deep learning has achieved popularity due to its improved prediction accuracy, and it finds widespread use in large data processing applications. First, we used deep neural networks to quicken the estimations of execution time, energy consumption and area of neural network accelerators by 10^6× against baseline cycle-accurate simulations. We call this mechanism EASTDNN (Expediting Architectural SimulaTions using Deep Neural Networks), which achieves high accuracy against the baseline. A major disadvantage of EAST-DNN is the time and cost needed to collect data for training the deep neural network. Thus, an efficient technique avoiding costly training data collection is essential. Second, we formulated closed-form analytical representations to further accelerate the estimations of hardware costs of deep learning accelerators without needing DNN training overhead, while achieving accuracy comparable to or better than EAST-DNN. We call it Pearl, an acronym used to represent the approach— Towards Optimization of DNN-accelerator s using closed-form analytical representation). In addition to high accuracy and speedup compared to state-of-the-art, Pearl also provides enough flexibility to explore a lot of parameters for a particular accelerator architecture template, and it can be extended to model several other architecture templates and dataflow mapping for efficient deep learning acceleration. Pearl formulation, in general, is independent of device technology and accelerator architecture. Third, we used the faster and more-accurate simulator based on the analytical models to explore and form optimization problems, as an application of Pearl in the search of efficient (and emerging) deep learning systems. We presented several case-studies using analytical models for DNN-accelerator optimization, while imposing user-defined constraints. Area-constrained and energy-delay pareto optimization of DNN-accelerators is presented as case-studies. Cases show efficient choice of memory and compute resources through accelerated Pearl-based simulations. Huge memory accesses imposed by large DNN workloads increase the overall energy consumption. A case where emerging memory improves energy consumption of a DNN accelerator configuration is obtained through these exploration methods. Fourth, we built emerging memory with resistive-random access memory (RRAM) and thin-film transistor (TFT) devices, acting as off-chip main memory for emerging multi-tier monolithic 3D system design. We used Pearl-based simulations to quantify the system-level benefits of an emerging computing system composing newer devices, with dataflow accelerators supported by Pearl, targeting deep learning inference. Material presented in this thesis paves the way for enabling ultra-scale exploration and optimization of domain-specific accelerators for deep learning inference in a short time. Using these approaches, the quality of any new hardware accelerator (new device technology, architecture, memory, or dataflow mapping) can be quickly evaluated and optimized while considering a specific deep learning application.