FPGA implementation of low-power real-time convolutional neural network inference

While artificial intelligence is applied in many areas of live, its computational intensity requires the presence of a large amount of computing resources. The data which are meant to be processed with those algorithms, however, are not generated in data centres or on desktop workstations. Instead,...

Full description

Saved in:
Bibliographic Details
Main Author: Gerlinghoff, Daniel
Other Authors: Zheng Yuanjin
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137750
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:While artificial intelligence is applied in many areas of live, its computational intensity requires the presence of a large amount of computing resources. The data which are meant to be processed with those algorithms, however, are not generated in data centres or on desktop workstations. Instead, they originate from mobile devices and sensor networks which are highly constrained in terms of hardware resources and power. To close this gap, this work presents an implementation of a convolutional neural network which aims to be deployed on low-power low-cost FPGA devices. Those devices are potentially used in IoT applications which involve the acquisition of a large amount of data. However, logic and memory resources of those FPGAs are sparse. Therefore, this implementation optimizes the execution of the convolution operation for scalability. By adjusting only a few parameters in the design, the deployment is possible on both low-power and high-performance devices. That is made possible by separating the data storage and the data processing. The implementation further features a careful planning of data movement in the device to minimize power consumption and logic utilization. Three different types of memory are employed for the caching of data. Data values are stored with an 8-bit resolution which leads to a drop of classification accuracy by around 0.5 %. The design was tested on an Altera Cyclone V device and achieved a performance of around 420 million operations per second at a clock frequency of 100 MHz. In relation to the power, the design runs at around 0.35 GOPS/W. That is lower compared to previous implementations. In terms of absolute power consumption, however, it is superior, as the complete functionality can be enabled with only around 1 Watt.