FPGA implementation of low-power real-time convolutional neural network inference
While artificial intelligence is applied in many areas of live, its computational intensity requires the presence of a large amount of computing resources. The data which are meant to be processed with those algorithms, however, are not generated in data centres or on desktop workstations. Instead,...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/137750 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | While artificial intelligence is applied in many areas of live, its computational intensity requires the presence of a large amount of computing resources. The data which are meant to be processed with those algorithms, however, are not generated in data centres or on desktop workstations. Instead, they originate from mobile devices and sensor networks which are highly constrained in terms of hardware resources and power.
To close this gap, this work presents an implementation of a convolutional neural network which aims to be deployed on low-power low-cost FPGA devices. Those devices are potentially used in IoT applications which involve the acquisition of a large amount of data. However, logic and memory resources of those FPGAs are sparse. Therefore, this implementation optimizes the execution of the convolution operation for scalability. By adjusting only a few parameters in the design, the deployment is possible on both low-power and high-performance devices. That is made possible by separating the data storage and the data processing. The implementation further features a careful planning of data movement in the device to minimize power consumption and logic utilization. Three different types of memory are employed for the caching of data. Data values are stored with an 8-bit resolution which leads to a drop of classification accuracy by around 0.5 %.
The design was tested on an Altera Cyclone V device and achieved a performance of around 420 million operations per second at a clock frequency of 100 MHz. In relation to the power, the design runs at around 0.35 GOPS/W. That is lower compared to previous implementations. In terms of absolute power consumption, however, it is superior, as the complete functionality can be enabled with only around 1 Watt. |
---|