FPGA acceleration of continual learning at the edge

Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demand...

Full description

Saved in:

Bibliographic Details
Main Author:	Piyasena Gane Pathirannahelage Duvindu
Other Authors:	Lam Siew Kei
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	https://hdl.handle.net/10356/153778
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-153778
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Piyasena Gane Pathirannahelage Duvindu FPGA acceleration of continual learning at the edge
description	Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demands of conventional model training, which exceeds the capabilities of resource-constrained edge devices running on tight power budgets. The dependency on the cloud for training is not suitable in many applications, where new objects or environmental conditions different from the ones present during training, are frequently encountered. In such applications, continual learning of new knowledge on the edge device becomes a necessity to avoid performance bottlenecks due to round-trip communication delays, network connectivity, and the available bandwidth. In this thesis, we propose Field-Programmable Gate Array (FPGA) based accelerator architecture and optimization strategies for a new paradigm of machine learning algorithms that is capable of continual learning. The proposed methods will enable edge FPGA systems to perform on-device deep continual learning for object classification. Specifically, the proposed methods aim to achieve real-time learning on-device, while providing for a high degree of scalability to learn a large number of classes. We first propose a FPGA accelerator for a Self-Organizing Neural Network (SONN), that can perform class-incremental continual learning in a streaming manner when combined with a CNN. The SONN model performs unsupervised learning from embedding features extracted from the CNN model by dynamically growing neurons and connections. We introduce design optimization strategies and runtime scheduling techniques to optimize resource usage, latency, and energy consumption. Experimental results based on Core50 dataset for continuous object recognition from video sequences, demonstrated that the proposed FPGA architecture outperforms CPU and GPU based counterparts in terms of latency and power. However, the SONN model grows proportionally to the classes learnt, which limits its scalability to learn a large number of classes efficiently. Next, we propose a FPGA accelerator for a Streaming Linear Discriminant Analysis (SLDA) model to overcome the scalability limitations of SONN. Similar to the SONN, SLDA performs continual learning from embedding features extracted from a CNN in a streaming manner. SLDA is highly scalable for learning a large number of classes as the network does not grow dynamically to accommodate new knowledge. We propose several design and runtime optimizations to minimize resource usage, latency, and energy consumption. Additionally, we introduce a new variant of SLDA and discuss the accuracy-efficiency trade-offs using popular datasets for continual learning, CoRE50 and CUB200. The results demonstrate that the proposed SLDA accelerator outperforms CPU and GPU counterparts, in terms of latency and energy efficiency. Finally, we demonstrate a full on-chip deep continual learning pipeline on FPGA, by integrating the proposed SLDA accelerator with Xilinx DPU, a programmable CNN accelerator IP. The design is implemented on a Xilinx Zynq Ultrascale+ MPSoC. In order to overcome the large performance bottleneck due to the communication overhead between the ARM processing system (PS) and programmable logic (PL), we implemented a Linux device driver that facilitates efficient memory mapping between PS-PL. The experimental results based on CoRE50 dataset show that the proposed pipeline is capable of performing continual learning at nearly the same latency as the inference pipeline, with only a marginal increase in energy consumption. Our results clearly demonstrate the viability of deploying real-time deep continual learning on edge AI systems that are equipped with FPGA accelerators.
author2	Lam Siew Kei
author_facet	Lam Siew Kei Piyasena Gane Pathirannahelage Duvindu
format	Thesis-Master by Research
author	Piyasena Gane Pathirannahelage Duvindu
author_sort	Piyasena Gane Pathirannahelage Duvindu
title	FPGA acceleration of continual learning at the edge
title_short	FPGA acceleration of continual learning at the edge
title_full	FPGA acceleration of continual learning at the edge
title_fullStr	FPGA acceleration of continual learning at the edge
title_full_unstemmed	FPGA acceleration of continual learning at the edge
title_sort	fpga acceleration of continual learning at the edge
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/153778
_version_	1722355342456651776
spelling	sg-ntu-dr.10356-1537782022-01-05T09:23:41Z FPGA acceleration of continual learning at the edge Piyasena Gane Pathirannahelage Duvindu Lam Siew Kei School of Computer Science and Engineering Hardware & Embedded Systems Lab (HESL) ASSKLam@ntu.edu.sg Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demands of conventional model training, which exceeds the capabilities of resource-constrained edge devices running on tight power budgets. The dependency on the cloud for training is not suitable in many applications, where new objects or environmental conditions different from the ones present during training, are frequently encountered. In such applications, continual learning of new knowledge on the edge device becomes a necessity to avoid performance bottlenecks due to round-trip communication delays, network connectivity, and the available bandwidth. In this thesis, we propose Field-Programmable Gate Array (FPGA) based accelerator architecture and optimization strategies for a new paradigm of machine learning algorithms that is capable of continual learning. The proposed methods will enable edge FPGA systems to perform on-device deep continual learning for object classification. Specifically, the proposed methods aim to achieve real-time learning on-device, while providing for a high degree of scalability to learn a large number of classes. We first propose a FPGA accelerator for a Self-Organizing Neural Network (SONN), that can perform class-incremental continual learning in a streaming manner when combined with a CNN. The SONN model performs unsupervised learning from embedding features extracted from the CNN model by dynamically growing neurons and connections. We introduce design optimization strategies and runtime scheduling techniques to optimize resource usage, latency, and energy consumption. Experimental results based on Core50 dataset for continuous object recognition from video sequences, demonstrated that the proposed FPGA architecture outperforms CPU and GPU based counterparts in terms of latency and power. However, the SONN model grows proportionally to the classes learnt, which limits its scalability to learn a large number of classes efficiently. Next, we propose a FPGA accelerator for a Streaming Linear Discriminant Analysis (SLDA) model to overcome the scalability limitations of SONN. Similar to the SONN, SLDA performs continual learning from embedding features extracted from a CNN in a streaming manner. SLDA is highly scalable for learning a large number of classes as the network does not grow dynamically to accommodate new knowledge. We propose several design and runtime optimizations to minimize resource usage, latency, and energy consumption. Additionally, we introduce a new variant of SLDA and discuss the accuracy-efficiency trade-offs using popular datasets for continual learning, CoRE50 and CUB200. The results demonstrate that the proposed SLDA accelerator outperforms CPU and GPU counterparts, in terms of latency and energy efficiency. Finally, we demonstrate a full on-chip deep continual learning pipeline on FPGA, by integrating the proposed SLDA accelerator with Xilinx DPU, a programmable CNN accelerator IP. The design is implemented on a Xilinx Zynq Ultrascale+ MPSoC. In order to overcome the large performance bottleneck due to the communication overhead between the ARM processing system (PS) and programmable logic (PL), we implemented a Linux device driver that facilitates efficient memory mapping between PS-PL. The experimental results based on CoRE50 dataset show that the proposed pipeline is capable of performing continual learning at nearly the same latency as the inference pipeline, with only a marginal increase in energy consumption. Our results clearly demonstrate the viability of deploying real-time deep continual learning on edge AI systems that are equipped with FPGA accelerators. Master of Engineering 2021-12-10T04:32:31Z 2021-12-10T04:32:31Z 2021 Thesis-Master by Research Piyasena Gane Pathirannahelage Duvindu (2021). FPGA acceleration of continual learning at the edge. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153778 https://hdl.handle.net/10356/153778 10.32657/10356/153778 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

FPGA acceleration of continual learning at the edge

Similar Items