Time-multiplexed FPGA overlays with linear interconnect
The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the ma...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/88026 http://hdl.handle.net/10220/46937 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the mainstream adoption of FPGAs and restricting their effective use to experts in hardware design. Coarse-grained overlay architectures have been proposed as a possible solution for improving design productivity by offering fast compilation and software-like programmability. These overlays can either be spatially configured (SC), with one complete functional unit (FU) allocated to each compute kernel operation and a routing network which is essentially static during computation, or, multiplexed, with the FUs and interconnect being shared between kernel operations.
This thesis examines an overlay architecture based on a simple linear interconnected array of time-multiplexed (TM) functional units. Sharing the FUs among kernel operations should significantly reduce the FPGA resource overhead compared to an SC overlay which requires one FU for each operation along with a fully functional routing network to support connections to neighboring FUs. The linear interconnected array of TM FUs should also result in reduced instruction storage and interconnect resource requirements compared to other TM overlays, again resulting in a more area efficient overlay.
In order to minimize the use of the fine-grained FPGA resource, we make use of the DSP block to design a fast, fully-pipelined, architecture-aware FU implementation, better targeting the capabilities of the FPGA. The results presented show a significant reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device. A number of architectural enhancements are then proposed to improve the performance of the DSP block based FU.
The overlay subsystem is then integrated into complete hardware accelerator systems, along with memory interfaces, to an ARM processor or a host CPU. To achieve this, we investigate two different memory solutions based on AXI and PCIe interfaces, namely Xillybus and RIFFA. The performance of these hardware accelerators for a range of benchmarks is investigated and performance results are presented. The proposed AXI-Xillybus-V3 overlay system is also compared to a state-of-art TM overlay, namely VectorBlox MXP. The comparison results show the AXI-Xillybus-V3 achieves a very area efficient implementation at the expense of around half of the throughput (limited by AXI-Xillybus using a 32-bit bus compared to the 64-bit bus used by VectorBlox MXP). The proposed RIFFA-V3 overlay system shows a 3.6× better performance compared to the PCIe-Xillybus-V3, and a 5.7× better performance than AXI-Xillybus-V3, but at the cost of a larger BRAM consumption. |
---|