FPGA acceleration of continual learning at the edge
Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demand...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/153778 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-153778 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Piyasena Gane Pathirannahelage Duvindu FPGA acceleration of continual learning at the edge |
description |
Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demands of conventional model training, which exceeds the capabilities of resource-constrained edge devices running on tight power budgets. The dependency on the cloud for training is not suitable in many applications, where new objects or environmental conditions different from the ones present during training, are frequently encountered. In such applications, continual learning of new knowledge on the edge device becomes a necessity to avoid performance bottlenecks due to round-trip communication delays, network connectivity, and the available bandwidth.
In this thesis, we propose Field-Programmable Gate Array (FPGA) based accelerator architecture and optimization strategies for a new paradigm of machine learning algorithms that is capable of continual learning. The proposed methods will enable edge FPGA systems to perform on-device deep continual learning for object classification. Specifically, the proposed methods aim to achieve real-time learning on-device, while providing for a high degree of scalability to learn a large number of classes.
We first propose a FPGA accelerator for a Self-Organizing Neural Network (SONN), that can perform class-incremental continual learning in a streaming manner when combined with a CNN. The SONN model performs unsupervised learning from embedding features extracted from the CNN model by dynamically growing neurons and connections. We introduce design optimization strategies and runtime scheduling techniques to optimize resource usage, latency, and energy consumption. Experimental results based on Core50 dataset for continuous object recognition from video sequences, demonstrated that the proposed FPGA architecture outperforms CPU and GPU based counterparts in terms of latency and power. However, the SONN model grows proportionally to the classes learnt, which limits its scalability to learn a large number of classes efficiently.
Next, we propose a FPGA accelerator for a Streaming Linear Discriminant Analysis (SLDA) model to overcome the scalability limitations of SONN. Similar to the SONN, SLDA performs continual learning from embedding features extracted from a CNN in a streaming manner. SLDA is highly scalable for learning a large number of classes as the network does not grow dynamically to accommodate new knowledge. We propose several design and runtime optimizations to minimize resource usage, latency, and energy consumption. Additionally, we introduce a new variant of SLDA and discuss the accuracy-efficiency trade-offs using popular datasets for continual learning, CoRE50 and CUB200. The results demonstrate that the proposed SLDA accelerator outperforms CPU and GPU counterparts, in terms of latency and energy efficiency.
Finally, we demonstrate a full on-chip deep continual learning pipeline on FPGA, by integrating the proposed SLDA accelerator with Xilinx DPU, a programmable CNN accelerator IP. The design is implemented on a Xilinx Zynq Ultrascale+ MPSoC. In order to overcome the large performance bottleneck due to the communication overhead between the ARM processing system (PS) and programmable logic (PL), we implemented a Linux device driver that facilitates efficient memory mapping between PS-PL. The experimental results based on CoRE50 dataset show that the proposed pipeline is capable of performing continual learning at nearly the same latency as the inference pipeline, with only a marginal increase in energy consumption. Our results clearly demonstrate the viability of deploying real-time deep continual learning on edge AI systems that are equipped with FPGA accelerators. |
author2 |
Lam Siew Kei |
author_facet |
Lam Siew Kei Piyasena Gane Pathirannahelage Duvindu |
format |
Thesis-Master by Research |
author |
Piyasena Gane Pathirannahelage Duvindu |
author_sort |
Piyasena Gane Pathirannahelage Duvindu |
title |
FPGA acceleration of continual learning at the edge |
title_short |
FPGA acceleration of continual learning at the edge |
title_full |
FPGA acceleration of continual learning at the edge |
title_fullStr |
FPGA acceleration of continual learning at the edge |
title_full_unstemmed |
FPGA acceleration of continual learning at the edge |
title_sort |
fpga acceleration of continual learning at the edge |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/153778 |
_version_ |
1722355342456651776 |
spelling |
sg-ntu-dr.10356-1537782022-01-05T09:23:41Z FPGA acceleration of continual learning at the edge Piyasena Gane Pathirannahelage Duvindu Lam Siew Kei School of Computer Science and Engineering Hardware & Embedded Systems Lab (HESL) ASSKLam@ntu.edu.sg Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demands of conventional model training, which exceeds the capabilities of resource-constrained edge devices running on tight power budgets. The dependency on the cloud for training is not suitable in many applications, where new objects or environmental conditions different from the ones present during training, are frequently encountered. In such applications, continual learning of new knowledge on the edge device becomes a necessity to avoid performance bottlenecks due to round-trip communication delays, network connectivity, and the available bandwidth. In this thesis, we propose Field-Programmable Gate Array (FPGA) based accelerator architecture and optimization strategies for a new paradigm of machine learning algorithms that is capable of continual learning. The proposed methods will enable edge FPGA systems to perform on-device deep continual learning for object classification. Specifically, the proposed methods aim to achieve real-time learning on-device, while providing for a high degree of scalability to learn a large number of classes. We first propose a FPGA accelerator for a Self-Organizing Neural Network (SONN), that can perform class-incremental continual learning in a streaming manner when combined with a CNN. The SONN model performs unsupervised learning from embedding features extracted from the CNN model by dynamically growing neurons and connections. We introduce design optimization strategies and runtime scheduling techniques to optimize resource usage, latency, and energy consumption. Experimental results based on Core50 dataset for continuous object recognition from video sequences, demonstrated that the proposed FPGA architecture outperforms CPU and GPU based counterparts in terms of latency and power. However, the SONN model grows proportionally to the classes learnt, which limits its scalability to learn a large number of classes efficiently. Next, we propose a FPGA accelerator for a Streaming Linear Discriminant Analysis (SLDA) model to overcome the scalability limitations of SONN. Similar to the SONN, SLDA performs continual learning from embedding features extracted from a CNN in a streaming manner. SLDA is highly scalable for learning a large number of classes as the network does not grow dynamically to accommodate new knowledge. We propose several design and runtime optimizations to minimize resource usage, latency, and energy consumption. Additionally, we introduce a new variant of SLDA and discuss the accuracy-efficiency trade-offs using popular datasets for continual learning, CoRE50 and CUB200. The results demonstrate that the proposed SLDA accelerator outperforms CPU and GPU counterparts, in terms of latency and energy efficiency. Finally, we demonstrate a full on-chip deep continual learning pipeline on FPGA, by integrating the proposed SLDA accelerator with Xilinx DPU, a programmable CNN accelerator IP. The design is implemented on a Xilinx Zynq Ultrascale+ MPSoC. In order to overcome the large performance bottleneck due to the communication overhead between the ARM processing system (PS) and programmable logic (PL), we implemented a Linux device driver that facilitates efficient memory mapping between PS-PL. The experimental results based on CoRE50 dataset show that the proposed pipeline is capable of performing continual learning at nearly the same latency as the inference pipeline, with only a marginal increase in energy consumption. Our results clearly demonstrate the viability of deploying real-time deep continual learning on edge AI systems that are equipped with FPGA accelerators. Master of Engineering 2021-12-10T04:32:31Z 2021-12-10T04:32:31Z 2021 Thesis-Master by Research Piyasena Gane Pathirannahelage Duvindu (2021). FPGA acceleration of continual learning at the edge. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153778 https://hdl.handle.net/10356/153778 10.32657/10356/153778 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |