Dataflow optimized overlays for FPGAs

Dataflow Coprocessor Overlay (DaCO) is an FPGA-tuned dataflow-driven overlay architecture that offers fine-grained parallelism capable of delivering speedups of up to 2.8x on sparse, irregular computations over competing architectures (e.g. modern microprocessors and existing dataflow overlays). DaC...

Full description

Saved in:
Bibliographic Details
Main Author: Siddhartha
Other Authors: Arvind Easwaran
Format: Theses and Dissertations
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/104891
http://hdl.handle.net/10220/47803
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Dataflow Coprocessor Overlay (DaCO) is an FPGA-tuned dataflow-driven overlay architecture that offers fine-grained parallelism capable of delivering speedups of up to 2.8x on sparse, irregular computations over competing architectures (e.g. modern microprocessors and existing dataflow overlays). DaCO delivers these improvements with a custom instruction datapath that exploits the raw parallelism exposed by the dataflow triggering rule - instructions execute asynchronously as soon as their operands are available. However, this simple triggering logic can expose large amounts of irregular instruction-level parallelism that can be hard to manage. This thesis addresses this challenge in three steps: (1) design of a lightweight scheduling circuit inside each DaCO soft-processor that enables large- scale out-of-order instruction execution at runtime, (2) design of a priority-aware communication framework that delivers improved quality of service to critical communication packets, and (3) compiler support that optimizes the dataflow graph structure for improved runtime execution. DaCO is optimized for the Arria 10 AX115S (20nm SoC) FPGA board in order to take advantage of the hard on-chip floating-point DSP blocks. Overall, when benchmarked with sparse-matrix vector multiply kernels, DaCO improves throughput performance by up to 2.4x over existing in-order dataflow overlays, and delivers a peak operational throughput of up to 38 MFLOPs/processor, or a peak total throughput of 3.5 GFLOPs/sec. The DaCO engine is composed of a custom dataflow-inspired soft-processor and a priority-aware Network on Chip (NoC) communication framework. Each soft-processor has a custom datapath that operates directly on the dataflow graph stored in local memory. We design a novel criticality-aware scheduling circuit inside the soft-processor that allows large-scale out-of-order node execution with minimal resource overheads. This is achieved by using a one-time memory re-organization strategy together with a lightweight leading-ones detector circuit. The datapath is fully-pipelined and employs data-forwarding for achieving high-performance, while the block RAMs (BRAMs) are multipumped to ensure efficient resource utilization. On the target Arria 10 chip, we can fit up to 600 soft-processors, where each DaCO soft-processor consumes 779 ALMs, four BRAMs, and three DSP blocks, and operates at a 3.7ns clock. The NoC communication framework is built with Hoplite-Q*, a novel FPGA-friendly router that augments the existing Hoplite router to support priority-aware routing features. Together, the DaCO soft-processor and Hoplite-Q* manage and prioritize critical compute paths that were left unaddressed in prior work. On its own, Hoplite-Q* can accelerate high-priority communication packets by up to 90% when compared to the baseline Hoplite router. Each Hoplite-Q* router consumes 215 ALMs (64b packet with 32b payload) and can operate at a 3.3ns clock. DaCO also supports a clustered topology, where soft-processors in the overlay can be grouped and connected by a local crossbar, while out-of-cluster communication is serviced by a Hoplite-Q* network. This strategy improves performance by up to 1.8–2x with only 15–40% resource overhead from the crossbar (cluster size of two to four). Finally, this thesis also explores the importance of criticality in dataflow workloads and the importance of compilation support. We explore the limits of recursive unrolling and tree balancing on dataflow graphs and quantify the tradeoffs between excess computation and reductions in the critical path with these techniques. We then develop a Huffman-inspired reassociation scheme that optimizes the dataflow graph based on a statically computed node/edge criticality. Together with fanin and fanout decomposition, we quantify the effect of all these software transformations on the dataflow graph and demonstrate the performance tradeoffs when run on hardware. These software transformations are packaged as compiler optimizations that provide an easy-to-use programming model for the DaCO engine. In the future, our aim is to develop the DaCO ecosystem further to support various flavors of dataflow-driven soft-processors. In particular, an asynchronous dynamic dataflow graph processor would map well to iterative problems from domains such as graph convolutional networks, molecular dynamics, and PageRank. In addition, we hope to improve the DaCO programming model by extending the existing ISA and supporting a codelet-based model, where the compute abstraction assumes a more coarse-grained instruction-graph.