Architecture centric coarse-grained FPGA overlays

Coarse-grained FPGA overlays have emerged as one possible solution to make FPGAs more accessible to application developers who are accustomed to software API abstractions and fast development cycles. Existing overlay architectures offer a number of advantages for general purpose hardware acceleratio...

Full description

Saved in:
Bibliographic Details
Main Author: Abhishek Kumar Jain
Other Authors: Douglas Leslie Maskell
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/69532
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Coarse-grained FPGA overlays have emerged as one possible solution to make FPGAs more accessible to application developers who are accustomed to software API abstractions and fast development cycles. Existing overlay architectures offer a number of advantages for general purpose hardware acceleration because of software-like programmability, fast compilation, application portability, and improved design productivity, but at the cost of area and performance overheads due to limited consideration for the underlying FPGA architecture. This thesis explores coarse grained overlays designed using the exible DSP48E1 primitive on Xilinx FPGAs, allowing pipelined execution of compute kernels at significantly higher throughput. We first evaluate an open source overlay architecture, DySER, mapped on the Xilinx Zynq device and show that DySER suffers from a significant area and performance overhead due to limited consideration for the underlying FPGA architecture. Next, we design and implement a more FPGA targeted overlay architecture that maximizes the peak performance and reduces the interconnect area overhead through the use of an array of DSP block based fully pipelined functional units and an island-style coarse-grained routing network. As the interconnect of the island-style overlay is still excessive, we next explore novel interconnect architectures to further reduce the interconnect area. We next develop DeCO, a cone shaped cluster of FUs, which shows 87% savings in LUT requirements compared to our island-style overlay, for a set of compute kernels. Our experimental evaluation shows that the proposed overlays exhibit frequencies close to the DSP theoretical limit and achieve high performance with significantly reduced area overheads. We also present a methodology for compiling high level language (C/OpenCL) descriptions of compute kernels onto DSP block based coarse-grained overlays. Our mapping ow provides a rapid, vendor independent mapping to the overlay, raising the abstraction level while also reducing compilation time significantly, hence addressing the design productivity issue.