Time-multiplexed FPGA overlays with linear interconnect

The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the ma...

Full description

Saved in:

Bibliographic Details
Main Author:	Li, Xiangwei
Other Authors:	Douglas Leslie Maskell
Format:	Theses and Dissertations
Language:	English
Published:	2018
Subjects:	DRNTU::Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/88026 http://hdl.handle.net/10220/46937
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-88026
record_format	dspace
spelling	sg-ntu-dr.10356-880262020-06-23T08:04:58Z Time-multiplexed FPGA overlays with linear interconnect Li, Xiangwei Douglas Leslie Maskell School of Computer Science and Engineering Centre for High Performance Embedded Systems DRNTU::Engineering::Computer science and engineering The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the mainstream adoption of FPGAs and restricting their effective use to experts in hardware design. Coarse-grained overlay architectures have been proposed as a possible solution for improving design productivity by offering fast compilation and software-like programmability. These overlays can either be spatially configured (SC), with one complete functional unit (FU) allocated to each compute kernel operation and a routing network which is essentially static during computation, or, multiplexed, with the FUs and interconnect being shared between kernel operations. This thesis examines an overlay architecture based on a simple linear interconnected array of time-multiplexed (TM) functional units. Sharing the FUs among kernel operations should significantly reduce the FPGA resource overhead compared to an SC overlay which requires one FU for each operation along with a fully functional routing network to support connections to neighboring FUs. The linear interconnected array of TM FUs should also result in reduced instruction storage and interconnect resource requirements compared to other TM overlays, again resulting in a more area efficient overlay. In order to minimize the use of the fine-grained FPGA resource, we make use of the DSP block to design a fast, fully-pipelined, architecture-aware FU implementation, better targeting the capabilities of the FPGA. The results presented show a significant reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device. A number of architectural enhancements are then proposed to improve the performance of the DSP block based FU. The overlay subsystem is then integrated into complete hardware accelerator systems, along with memory interfaces, to an ARM processor or a host CPU. To achieve this, we investigate two different memory solutions based on AXI and PCIe interfaces, namely Xillybus and RIFFA. The performance of these hardware accelerators for a range of benchmarks is investigated and performance results are presented. The proposed AXI-Xillybus-V3 overlay system is also compared to a state-of-art TM overlay, namely VectorBlox MXP. The comparison results show the AXI-Xillybus-V3 achieves a very area efficient implementation at the expense of around half of the throughput (limited by AXI-Xillybus using a 32-bit bus compared to the 64-bit bus used by VectorBlox MXP). The proposed RIFFA-V3 overlay system shows a 3.6× better performance compared to the PCIe-Xillybus-V3, and a 5.7× better performance than AXI-Xillybus-V3, but at the cost of a larger BRAM consumption. Doctor of Philosophy 2018-12-12T13:36:39Z 2019-12-06T16:54:24Z 2018-12-12T13:36:39Z 2019-12-06T16:54:24Z 2018 Thesis Li, X. (2018). Time-multiplexed FPGA overlays with linear interconnect. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/88026 http://hdl.handle.net/10220/46937 10.32657/10220/46937 en 136 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Li, Xiangwei Time-multiplexed FPGA overlays with linear interconnect
description	The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the mainstream adoption of FPGAs and restricting their effective use to experts in hardware design. Coarse-grained overlay architectures have been proposed as a possible solution for improving design productivity by offering fast compilation and software-like programmability. These overlays can either be spatially configured (SC), with one complete functional unit (FU) allocated to each compute kernel operation and a routing network which is essentially static during computation, or, multiplexed, with the FUs and interconnect being shared between kernel operations. This thesis examines an overlay architecture based on a simple linear interconnected array of time-multiplexed (TM) functional units. Sharing the FUs among kernel operations should significantly reduce the FPGA resource overhead compared to an SC overlay which requires one FU for each operation along with a fully functional routing network to support connections to neighboring FUs. The linear interconnected array of TM FUs should also result in reduced instruction storage and interconnect resource requirements compared to other TM overlays, again resulting in a more area efficient overlay. In order to minimize the use of the fine-grained FPGA resource, we make use of the DSP block to design a fast, fully-pipelined, architecture-aware FU implementation, better targeting the capabilities of the FPGA. The results presented show a significant reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device. A number of architectural enhancements are then proposed to improve the performance of the DSP block based FU. The overlay subsystem is then integrated into complete hardware accelerator systems, along with memory interfaces, to an ARM processor or a host CPU. To achieve this, we investigate two different memory solutions based on AXI and PCIe interfaces, namely Xillybus and RIFFA. The performance of these hardware accelerators for a range of benchmarks is investigated and performance results are presented. The proposed AXI-Xillybus-V3 overlay system is also compared to a state-of-art TM overlay, namely VectorBlox MXP. The comparison results show the AXI-Xillybus-V3 achieves a very area efficient implementation at the expense of around half of the throughput (limited by AXI-Xillybus using a 32-bit bus compared to the 64-bit bus used by VectorBlox MXP). The proposed RIFFA-V3 overlay system shows a 3.6× better performance compared to the PCIe-Xillybus-V3, and a 5.7× better performance than AXI-Xillybus-V3, but at the cost of a larger BRAM consumption.
author2	Douglas Leslie Maskell
author_facet	Douglas Leslie Maskell Li, Xiangwei
format	Theses and Dissertations
author	Li, Xiangwei
author_sort	Li, Xiangwei
title	Time-multiplexed FPGA overlays with linear interconnect
title_short	Time-multiplexed FPGA overlays with linear interconnect
title_full	Time-multiplexed FPGA overlays with linear interconnect
title_fullStr	Time-multiplexed FPGA overlays with linear interconnect
title_full_unstemmed	Time-multiplexed FPGA overlays with linear interconnect
title_sort	time-multiplexed fpga overlays with linear interconnect
publishDate	2018
url	https://hdl.handle.net/10356/88026 http://hdl.handle.net/10220/46937
_version_	1681058394143719424

Time-multiplexed FPGA overlays with linear interconnect

Similar Items