Time-multiplexed FPGA overlays with linear interconnect

The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the ma...

Full description

Saved in:
Bibliographic Details
Main Author: Li, Xiangwei
Other Authors: Douglas Leslie Maskell
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:https://hdl.handle.net/10356/88026
http://hdl.handle.net/10220/46937
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-88026
record_format dspace
spelling sg-ntu-dr.10356-880262020-06-23T08:04:58Z Time-multiplexed FPGA overlays with linear interconnect Li, Xiangwei Douglas Leslie Maskell School of Computer Science and Engineering Centre for High Performance Embedded Systems DRNTU::Engineering::Computer science and engineering The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the mainstream adoption of FPGAs and restricting their effective use to experts in hardware design. Coarse-grained overlay architectures have been proposed as a possible solution for improving design productivity by offering fast compilation and software-like programmability. These overlays can either be spatially configured (SC), with one complete functional unit (FU) allocated to each compute kernel operation and a routing network which is essentially static during computation, or, multiplexed, with the FUs and interconnect being shared between kernel operations. This thesis examines an overlay architecture based on a simple linear interconnected array of time-multiplexed (TM) functional units. Sharing the FUs among kernel operations should significantly reduce the FPGA resource overhead compared to an SC overlay which requires one FU for each operation along with a fully functional routing network to support connections to neighboring FUs. The linear interconnected array of TM FUs should also result in reduced instruction storage and interconnect resource requirements compared to other TM overlays, again resulting in a more area efficient overlay. In order to minimize the use of the fine-grained FPGA resource, we make use of the DSP block to design a fast, fully-pipelined, architecture-aware FU implementation, better targeting the capabilities of the FPGA. The results presented show a significant reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device. A number of architectural enhancements are then proposed to improve the performance of the DSP block based FU. The overlay subsystem is then integrated into complete hardware accelerator systems, along with memory interfaces, to an ARM processor or a host CPU. To achieve this, we investigate two different memory solutions based on AXI and PCIe interfaces, namely Xillybus and RIFFA. The performance of these hardware accelerators for a range of benchmarks is investigated and performance results are presented. The proposed AXI-Xillybus-V3 overlay system is also compared to a state-of-art TM overlay, namely VectorBlox MXP. The comparison results show the AXI-Xillybus-V3 achieves a very area efficient implementation at the expense of around half of the throughput (limited by AXI-Xillybus using a 32-bit bus compared to the 64-bit bus used by VectorBlox MXP). The proposed RIFFA-V3 overlay system shows a 3.6× better performance compared to the PCIe-Xillybus-V3, and a 5.7× better performance than AXI-Xillybus-V3, but at the cost of a larger BRAM consumption. Doctor of Philosophy 2018-12-12T13:36:39Z 2019-12-06T16:54:24Z 2018-12-12T13:36:39Z 2019-12-06T16:54:24Z 2018 Thesis Li, X. (2018). Time-multiplexed FPGA overlays with linear interconnect. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/88026 http://hdl.handle.net/10220/46937 10.32657/10220/46937 en 136 p. application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Li, Xiangwei
Time-multiplexed FPGA overlays with linear interconnect
description The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the mainstream adoption of FPGAs and restricting their effective use to experts in hardware design. Coarse-grained overlay architectures have been proposed as a possible solution for improving design productivity by offering fast compilation and software-like programmability. These overlays can either be spatially configured (SC), with one complete functional unit (FU) allocated to each compute kernel operation and a routing network which is essentially static during computation, or, multiplexed, with the FUs and interconnect being shared between kernel operations. This thesis examines an overlay architecture based on a simple linear interconnected array of time-multiplexed (TM) functional units. Sharing the FUs among kernel operations should significantly reduce the FPGA resource overhead compared to an SC overlay which requires one FU for each operation along with a fully functional routing network to support connections to neighboring FUs. The linear interconnected array of TM FUs should also result in reduced instruction storage and interconnect resource requirements compared to other TM overlays, again resulting in a more area efficient overlay. In order to minimize the use of the fine-grained FPGA resource, we make use of the DSP block to design a fast, fully-pipelined, architecture-aware FU implementation, better targeting the capabilities of the FPGA. The results presented show a significant reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device. A number of architectural enhancements are then proposed to improve the performance of the DSP block based FU. The overlay subsystem is then integrated into complete hardware accelerator systems, along with memory interfaces, to an ARM processor or a host CPU. To achieve this, we investigate two different memory solutions based on AXI and PCIe interfaces, namely Xillybus and RIFFA. The performance of these hardware accelerators for a range of benchmarks is investigated and performance results are presented. The proposed AXI-Xillybus-V3 overlay system is also compared to a state-of-art TM overlay, namely VectorBlox MXP. The comparison results show the AXI-Xillybus-V3 achieves a very area efficient implementation at the expense of around half of the throughput (limited by AXI-Xillybus using a 32-bit bus compared to the 64-bit bus used by VectorBlox MXP). The proposed RIFFA-V3 overlay system shows a 3.6× better performance compared to the PCIe-Xillybus-V3, and a 5.7× better performance than AXI-Xillybus-V3, but at the cost of a larger BRAM consumption.
author2 Douglas Leslie Maskell
author_facet Douglas Leslie Maskell
Li, Xiangwei
format Theses and Dissertations
author Li, Xiangwei
author_sort Li, Xiangwei
title Time-multiplexed FPGA overlays with linear interconnect
title_short Time-multiplexed FPGA overlays with linear interconnect
title_full Time-multiplexed FPGA overlays with linear interconnect
title_fullStr Time-multiplexed FPGA overlays with linear interconnect
title_full_unstemmed Time-multiplexed FPGA overlays with linear interconnect
title_sort time-multiplexed fpga overlays with linear interconnect
publishDate 2018
url https://hdl.handle.net/10356/88026
http://hdl.handle.net/10220/46937
_version_ 1681058394143719424