Mapping large-scale systems on to high density FPGAs
Modern FPGAs that benefit from advancement in process technology and hard IP cores are increasingly becoming the choice for multi-million logic cell designs due to lower Non-Recurring Engineering (NRE) costs and shorter Time-to-Market (TTM) pressures. While state-of-the-art CAD tools are capable of...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/144029 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Modern FPGAs that benefit from advancement in process technology and hard IP cores are increasingly becoming the choice for multi-million logic cell designs due to lower Non-Recurring Engineering (NRE) costs and shorter Time-to-Market (TTM) pressures. While state-of-the-art CAD tools are capable of efficiently mapping small to medium-scale designs, they suffer from prohibitively long compilation time, higher power consumption and sub-optimal compute performance for large applications. In this thesis, novel techniques have been proposed to accelerate the mapping of large applications into high-density heterogeneous FPGAs while lowering the overall power consumption without compromising compute performance.
A communication-aware partitioning technique has been proposed to represent a large design into smaller subsystems by analysing the signal interaction characteristics at the Register Transfer Level (RTL) in Chapter 3. Basic computation units exhibiting higher intra-unit communication costs are systematically merged to form a subsystem without violating the constraints imposed on the routing resources of target FPGA. Moreover, an area constraint is also imposed during this merging process to ascertain that the basic computation units of a subsystem remain in close proximity to each other. The subsystem generation process has been automated to represent a large design into subsystems, given the area constraint and routing congestion metric and each subsystem was subsequently ported to FPGA using the commercial Intel (Altera) Quartus Prime software. Evaluations based on large applications, derived from the widely used Polybench benchmark suite, have been carried out to show that the proposed technique achieves up to 60% improvement in compute performance F.max and nearly 20% reduction in the overall energy consumption. Moreover, the proposed method can also be readily applied to other FPGA families and CAD flows.
A scalable technique for characterizing modern FPGA devices has been proposed next to facilitate the target-aware mapping of subsystems of a large design onto high-density heterogeneous FPGAs. This one-time offline characterization process involves the determination of FPGA resources within templates of varying sizes at any given location on the target FPGA. A wide range of computation units, exhibiting the characteristics of Polybench benchmark suite, were then mapped onto these templates for estimating F.max at any given location on the target FPGA. A machine learning model was also proposed by leveraging on the large number of data extracted from this characterization process to automate F.max estimation with high accuracy. The proposed technique has been extensively evaluated using 26 applications from the Polybench benchmark suite and synthetic benchmarks on 4 FPGA devices from Altera to show that the average error of this automated F.max estimation process is less than 4%.
In order to minimize inter-subsystem communication costs of a large design, Ant Colony Optimization (ACO) algorithm is proposed to identify subsystems that must be placed in close proximity to each other. Routing congestion estimation has also been incorporated to limit the size of each group of subsystems. Next, most profitable footprints, defined by position and shape of templates, are selected for each subsystem. A modified ACO algorithm is then employed to determine the most optimal non-overlapping footprint for each subsystem. The proposed ACO algorithm further aims to identify footprints in close proximity to maximize compute performance and power savings. Footprints exhibiting a tight integration are locked in place to facilitate rapid convergence of ACO’s iterative process. Experimental results show the subsystem placement technique can outperform Quartus CAD flow, with over 17% of routing power reduction without impeding on performance F.max.
The proposed techniques presented in this thesis have been integrated into an automated subsystem mapping framework that can be leveraged to efficiently map large applications that consume 160K or more logic cells onto high-density heterogeneous FPGAs. To further ensure tight integration of subsystems, a subsystem splitting strategy was also incorporated to overcome inadvertent gaps that are formed during the ACO placement process. Experimental results show that the proposed framework can reduce routing power by over 18% while improving compute performance by 8%. Moreover, the solutions generated by the proposed approach are close to 23% more energy-efficient than those generated using existing commercial CAD tools. Unlike state-of-the-art commercial tools that take prohibitively long time for mapping large applications, the proposed approach is also notably faster and compile deterministically.
The integrated framework has also been deployed for an automated design space exploration in which the parameters used to control the subsystem generation process are varied to influence both the size and number of subsystems for a given large application. Each subsystem so generated are mapped onto target FPGA with the help of latest commercial CAD tools to facilitate the design space exploration. Unlike existing commercial solutions, the techniques proposed have paved the way for notable power savings, particularly for large applications. Moreover, they provide for higher F.max performance and lower runtime, thereby overcoming the drawback of state-of-the-art commercial solutions. Therefore, the proposed techniques can be readily deployed to support emerging high-density FPGA architectures while complementing the advances in the emerging commercial tools to map very large applications. |
---|