FPGA acceleration of continual learning at the edge

Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demand...

Full description

Saved in:
Bibliographic Details
Main Author: Piyasena Gane Pathirannahelage Duvindu
Other Authors: Lam Siew Kei
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153778
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-153778
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Hardware
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Hardware
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Piyasena Gane Pathirannahelage Duvindu
FPGA acceleration of continual learning at the edge
description Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demands of conventional model training, which exceeds the capabilities of resource-constrained edge devices running on tight power budgets. The dependency on the cloud for training is not suitable in many applications, where new objects or environmental conditions different from the ones present during training, are frequently encountered. In such applications, continual learning of new knowledge on the edge device becomes a necessity to avoid performance bottlenecks due to round-trip communication delays, network connectivity, and the available bandwidth. In this thesis, we propose Field-Programmable Gate Array (FPGA) based accelerator architecture and optimization strategies for a new paradigm of machine learning algorithms that is capable of continual learning. The proposed methods will enable edge FPGA systems to perform on-device deep continual learning for object classification. Specifically, the proposed methods aim to achieve real-time learning on-device, while providing for a high degree of scalability to learn a large number of classes. We first propose a FPGA accelerator for a Self-Organizing Neural Network (SONN), that can perform class-incremental continual learning in a streaming manner when combined with a CNN. The SONN model performs unsupervised learning from embedding features extracted from the CNN model by dynamically growing neurons and connections. We introduce design optimization strategies and runtime scheduling techniques to optimize resource usage, latency, and energy consumption. Experimental results based on Core50 dataset for continuous object recognition from video sequences, demonstrated that the proposed FPGA architecture outperforms CPU and GPU based counterparts in terms of latency and power. However, the SONN model grows proportionally to the classes learnt, which limits its scalability to learn a large number of classes efficiently. Next, we propose a FPGA accelerator for a Streaming Linear Discriminant Analysis (SLDA) model to overcome the scalability limitations of SONN. Similar to the SONN, SLDA performs continual learning from embedding features extracted from a CNN in a streaming manner. SLDA is highly scalable for learning a large number of classes as the network does not grow dynamically to accommodate new knowledge. We propose several design and runtime optimizations to minimize resource usage, latency, and energy consumption. Additionally, we introduce a new variant of SLDA and discuss the accuracy-efficiency trade-offs using popular datasets for continual learning, CoRE50 and CUB200. The results demonstrate that the proposed SLDA accelerator outperforms CPU and GPU counterparts, in terms of latency and energy efficiency. Finally, we demonstrate a full on-chip deep continual learning pipeline on FPGA, by integrating the proposed SLDA accelerator with Xilinx DPU, a programmable CNN accelerator IP. The design is implemented on a Xilinx Zynq Ultrascale+ MPSoC. In order to overcome the large performance bottleneck due to the communication overhead between the ARM processing system (PS) and programmable logic (PL), we implemented a Linux device driver that facilitates efficient memory mapping between PS-PL. The experimental results based on CoRE50 dataset show that the proposed pipeline is capable of performing continual learning at nearly the same latency as the inference pipeline, with only a marginal increase in energy consumption. Our results clearly demonstrate the viability of deploying real-time deep continual learning on edge AI systems that are equipped with FPGA accelerators.
author2 Lam Siew Kei
author_facet Lam Siew Kei
Piyasena Gane Pathirannahelage Duvindu
format Thesis-Master by Research
author Piyasena Gane Pathirannahelage Duvindu
author_sort Piyasena Gane Pathirannahelage Duvindu
title FPGA acceleration of continual learning at the edge
title_short FPGA acceleration of continual learning at the edge
title_full FPGA acceleration of continual learning at the edge
title_fullStr FPGA acceleration of continual learning at the edge
title_full_unstemmed FPGA acceleration of continual learning at the edge
title_sort fpga acceleration of continual learning at the edge
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/153778
_version_ 1722355342456651776
spelling sg-ntu-dr.10356-1537782022-01-05T09:23:41Z FPGA acceleration of continual learning at the edge Piyasena Gane Pathirannahelage Duvindu Lam Siew Kei School of Computer Science and Engineering Hardware & Embedded Systems Lab (HESL) ASSKLam@ntu.edu.sg Engineering::Computer science and engineering::Hardware Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demands of conventional model training, which exceeds the capabilities of resource-constrained edge devices running on tight power budgets. The dependency on the cloud for training is not suitable in many applications, where new objects or environmental conditions different from the ones present during training, are frequently encountered. In such applications, continual learning of new knowledge on the edge device becomes a necessity to avoid performance bottlenecks due to round-trip communication delays, network connectivity, and the available bandwidth. In this thesis, we propose Field-Programmable Gate Array (FPGA) based accelerator architecture and optimization strategies for a new paradigm of machine learning algorithms that is capable of continual learning. The proposed methods will enable edge FPGA systems to perform on-device deep continual learning for object classification. Specifically, the proposed methods aim to achieve real-time learning on-device, while providing for a high degree of scalability to learn a large number of classes. We first propose a FPGA accelerator for a Self-Organizing Neural Network (SONN), that can perform class-incremental continual learning in a streaming manner when combined with a CNN. The SONN model performs unsupervised learning from embedding features extracted from the CNN model by dynamically growing neurons and connections. We introduce design optimization strategies and runtime scheduling techniques to optimize resource usage, latency, and energy consumption. Experimental results based on Core50 dataset for continuous object recognition from video sequences, demonstrated that the proposed FPGA architecture outperforms CPU and GPU based counterparts in terms of latency and power. However, the SONN model grows proportionally to the classes learnt, which limits its scalability to learn a large number of classes efficiently. Next, we propose a FPGA accelerator for a Streaming Linear Discriminant Analysis (SLDA) model to overcome the scalability limitations of SONN. Similar to the SONN, SLDA performs continual learning from embedding features extracted from a CNN in a streaming manner. SLDA is highly scalable for learning a large number of classes as the network does not grow dynamically to accommodate new knowledge. We propose several design and runtime optimizations to minimize resource usage, latency, and energy consumption. Additionally, we introduce a new variant of SLDA and discuss the accuracy-efficiency trade-offs using popular datasets for continual learning, CoRE50 and CUB200. The results demonstrate that the proposed SLDA accelerator outperforms CPU and GPU counterparts, in terms of latency and energy efficiency. Finally, we demonstrate a full on-chip deep continual learning pipeline on FPGA, by integrating the proposed SLDA accelerator with Xilinx DPU, a programmable CNN accelerator IP. The design is implemented on a Xilinx Zynq Ultrascale+ MPSoC. In order to overcome the large performance bottleneck due to the communication overhead between the ARM processing system (PS) and programmable logic (PL), we implemented a Linux device driver that facilitates efficient memory mapping between PS-PL. The experimental results based on CoRE50 dataset show that the proposed pipeline is capable of performing continual learning at nearly the same latency as the inference pipeline, with only a marginal increase in energy consumption. Our results clearly demonstrate the viability of deploying real-time deep continual learning on edge AI systems that are equipped with FPGA accelerators. Master of Engineering 2021-12-10T04:32:31Z 2021-12-10T04:32:31Z 2021 Thesis-Master by Research Piyasena Gane Pathirannahelage Duvindu (2021). FPGA acceleration of continual learning at the edge. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153778 https://hdl.handle.net/10356/153778 10.32657/10356/153778 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University