Automating FPGA-based CSoC platform generation
Field Programmable Gate Array (FPGA) based configurable system-on-chip (CSoC) platforms have become a preferred choice for embedded computing systems to meet the increasing demand for shorter Time-to-Market (TTM) and lower Non-Recurring Engineering (NRE) costs, due to both high density and myriad of...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2019
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/90141 http://hdl.handle.net/10220/48432 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-90141 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
country |
Singapore |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering |
spellingShingle |
DRNTU::Engineering::Computer science and engineering Siriwardena Wijesundera, Deshya Senelie Automating FPGA-based CSoC platform generation |
description |
Field Programmable Gate Array (FPGA) based configurable system-on-chip (CSoC) platforms have become a preferred choice for embedded computing systems to meet the increasing demand for shorter Time-to-Market (TTM) and lower Non-Recurring Engineering (NRE) costs, due to both high density and myriad of on-chip hardware and software compute resources. However, the inability of existing tools to effectively exploit these resources to satisfy design constraints, especially from high level specifications such as C/C++, remains a bottleneck for meeting the TTM pressures. In this research, techniques for the automatic generation of a FPGA-based CSoC platform have been proposed to satisfy the area-time design constraints by taking user preferences into account.
A rapid technique has been proposed to estimate application performance (runtime) on soft core processors. The proposed methodology relies on the target independent intermediate representation (IR) of the LLVM compiler, without necessitating application execution on the target processor or instruction set simulators, thereby making it applicable to other soft core processors and corresponding FPGA architectures. Further, the approach is scalable to the large number of configuration options available in modern soft core processors. The technique takes into account both data hazards and control hazards within the processor pipeline in order to obtain high estimation accuracy. Experimental results using applications from the CHStone benchmark suite on two commercial soft core processors, Xilinx MicroBlaze and Altera Nios show an error of only 5% averaged across the full design space.
Noting that modern FPGA platforms typically also integrate hard core processors, the technique proposed for soft core processors has been extended to support the performance estimation for hard core processors. This necessitated the introduction of models for addressing performance-centric features such as dual-issue, out-of-order and superscalar. The proposed technique takes into account data hazards, control hazards and structural hazards within the processor pipeline in order to obtain high estimation accuracy. The technique has been tested using applications from the CHStone benchmark suite for the ARM Cortex-A9 processor in a Xilinx Zynq SoC FPGA and has been shown to be accurate with an average estimation error of only 5.84%. The estimation accuracy compares well with that of soft core processor performance estimation. Moreover, a unified framework to facilitate the performance estimation on both soft core and hard core processors has been proposed and described.
A novel technique for hardware area-time estimation of applications on FPGA has been proposed. The application C code was first converted to the target independent LLVM IR prior to wrapping the basic blocks as functions using a LLVM transformation pass. The LegUp tool’s ‘LLVM IR functions to RTL modules’ conversion was carried out to facilitate RTL synthesis using the Altera Quartus tools. In order to support FPGAs other than Altera, the soft IP cores generated by LegUp were replaced as generic RTL components. This approach, together with methods for incorporating vendor-specific basic IP cores, has made it possible to support FPGAs from other vendors with high area-time estimation accuracy. The proposed technique relies on the free versions of the vendor tools and LegUp. Moreover, the proposed approach does not necessitate time consuming post synthesis steps such as Place & Route and Bit Stream Generation in order to obtain reasonably accurate area estimation measures.
A technique for data dependency-aware hardware-software partitioning has been proposed. The complex data dependencies between the basic blocks amenable for hardware acceleration as well as all the memory components are identified to facilitate the hardware-software partitioning. The partitioning problem is then solved as an Oregon Trail Knapsack where the value of an item of type x depends on the presence of another item of type y in the knapsack. In order to overcome the exponential complexity of traditional hardware-software partitioning problems, an integer linear programming based heuristic with a time complexity of O(n^2) hasbeen employed to solve the Oregon Trail Knapsack model. The proposed heuristic is capable of recommending the most suitable hardware-software partitioning under multiple user specified area constraints. The experimental results using applications from the CHStone benchmark suite show that the performance of the recommended accelerators from the proposed technique is within 99% of that obtained from an exhaustive approach. When compared with the existing state-of-the-art, our approach reduced the runtime by orders of magnitude.
A novel application-specific instruction subsetting technique for soft core processor customization has been proposed to minimize the area utilized by the soft core processor. The design space is first pruned by exploiting dependencies between configurations before invoking a systematic subsetting of the microarchitecture by referring to the LLVM IR of the application. Two Impact ordered tree based selection heuristics with time complexity of O(n) have been proposed to perform application-specific instruction subsetting. The proposed technique has been tested using applications from the CHStone benchmark suite and 2 handwritten applications on the Altera Nios soft core processor. The methodology provides an average area reduction of 47.58% compared to a processor with all configuration options enabled. Further, the selection heuristics provide a significant speed up in theorder of 106X as opposed to an existing approach.
The various methods proposed earlier have been integrated to facilitate the automatic generation of constraint-aware platforms for application specific FPGA-based CSoC. The automated framework, called Wibheda+, strives to obtain the best performance under user specified area constraint and soft core/hard core processor (if any). Pre-defined code segments can be excluded from hardware acceleration to accommodate design flexibility. Wibheda+ has been tested extensively for a comprehensive evaluation with ARM Cortex A9, Altera Nios and Xilinx MicroBlaze processors. Applications from CHStone benchmark suite have been employed to evaluate Wibheda+ on Cyclone V, Cyclone II and Arria II FPGA devices from Altera as well as Kintex 7 and Artix 7 from Xilinx. Experimental results show that Wibheda+ can consistently identify near-optimal solutions in a few minutes. Wibheda+ can also be relied upon to recommend the most suitable CSoC platform from the supported FPGA devices and processors. Finally, the proposed automatic generation of constraint-aware platforms for application specific FPGA-based CSoC has made it possible to facilitate the rapid design space exploration of complex applications without violating the stringent TTM requirements. |
author2 |
Thambipillai Srikanthan |
author_facet |
Thambipillai Srikanthan Siriwardena Wijesundera, Deshya Senelie |
format |
Theses and Dissertations |
author |
Siriwardena Wijesundera, Deshya Senelie |
author_sort |
Siriwardena Wijesundera, Deshya Senelie |
title |
Automating FPGA-based CSoC platform generation |
title_short |
Automating FPGA-based CSoC platform generation |
title_full |
Automating FPGA-based CSoC platform generation |
title_fullStr |
Automating FPGA-based CSoC platform generation |
title_full_unstemmed |
Automating FPGA-based CSoC platform generation |
title_sort |
automating fpga-based csoc platform generation |
publishDate |
2019 |
url |
https://hdl.handle.net/10356/90141 http://hdl.handle.net/10220/48432 |
_version_ |
1681057989827493888 |
spelling |
sg-ntu-dr.10356-901412020-07-02T01:59:33Z Automating FPGA-based CSoC platform generation Siriwardena Wijesundera, Deshya Senelie Thambipillai Srikanthan School of Computer Science and Engineering Centre for High Performance Embedded Systems DRNTU::Engineering::Computer science and engineering Field Programmable Gate Array (FPGA) based configurable system-on-chip (CSoC) platforms have become a preferred choice for embedded computing systems to meet the increasing demand for shorter Time-to-Market (TTM) and lower Non-Recurring Engineering (NRE) costs, due to both high density and myriad of on-chip hardware and software compute resources. However, the inability of existing tools to effectively exploit these resources to satisfy design constraints, especially from high level specifications such as C/C++, remains a bottleneck for meeting the TTM pressures. In this research, techniques for the automatic generation of a FPGA-based CSoC platform have been proposed to satisfy the area-time design constraints by taking user preferences into account. A rapid technique has been proposed to estimate application performance (runtime) on soft core processors. The proposed methodology relies on the target independent intermediate representation (IR) of the LLVM compiler, without necessitating application execution on the target processor or instruction set simulators, thereby making it applicable to other soft core processors and corresponding FPGA architectures. Further, the approach is scalable to the large number of configuration options available in modern soft core processors. The technique takes into account both data hazards and control hazards within the processor pipeline in order to obtain high estimation accuracy. Experimental results using applications from the CHStone benchmark suite on two commercial soft core processors, Xilinx MicroBlaze and Altera Nios show an error of only 5% averaged across the full design space. Noting that modern FPGA platforms typically also integrate hard core processors, the technique proposed for soft core processors has been extended to support the performance estimation for hard core processors. This necessitated the introduction of models for addressing performance-centric features such as dual-issue, out-of-order and superscalar. The proposed technique takes into account data hazards, control hazards and structural hazards within the processor pipeline in order to obtain high estimation accuracy. The technique has been tested using applications from the CHStone benchmark suite for the ARM Cortex-A9 processor in a Xilinx Zynq SoC FPGA and has been shown to be accurate with an average estimation error of only 5.84%. The estimation accuracy compares well with that of soft core processor performance estimation. Moreover, a unified framework to facilitate the performance estimation on both soft core and hard core processors has been proposed and described. A novel technique for hardware area-time estimation of applications on FPGA has been proposed. The application C code was first converted to the target independent LLVM IR prior to wrapping the basic blocks as functions using a LLVM transformation pass. The LegUp tool’s ‘LLVM IR functions to RTL modules’ conversion was carried out to facilitate RTL synthesis using the Altera Quartus tools. In order to support FPGAs other than Altera, the soft IP cores generated by LegUp were replaced as generic RTL components. This approach, together with methods for incorporating vendor-specific basic IP cores, has made it possible to support FPGAs from other vendors with high area-time estimation accuracy. The proposed technique relies on the free versions of the vendor tools and LegUp. Moreover, the proposed approach does not necessitate time consuming post synthesis steps such as Place & Route and Bit Stream Generation in order to obtain reasonably accurate area estimation measures. A technique for data dependency-aware hardware-software partitioning has been proposed. The complex data dependencies between the basic blocks amenable for hardware acceleration as well as all the memory components are identified to facilitate the hardware-software partitioning. The partitioning problem is then solved as an Oregon Trail Knapsack where the value of an item of type x depends on the presence of another item of type y in the knapsack. In order to overcome the exponential complexity of traditional hardware-software partitioning problems, an integer linear programming based heuristic with a time complexity of O(n^2) hasbeen employed to solve the Oregon Trail Knapsack model. The proposed heuristic is capable of recommending the most suitable hardware-software partitioning under multiple user specified area constraints. The experimental results using applications from the CHStone benchmark suite show that the performance of the recommended accelerators from the proposed technique is within 99% of that obtained from an exhaustive approach. When compared with the existing state-of-the-art, our approach reduced the runtime by orders of magnitude. A novel application-specific instruction subsetting technique for soft core processor customization has been proposed to minimize the area utilized by the soft core processor. The design space is first pruned by exploiting dependencies between configurations before invoking a systematic subsetting of the microarchitecture by referring to the LLVM IR of the application. Two Impact ordered tree based selection heuristics with time complexity of O(n) have been proposed to perform application-specific instruction subsetting. The proposed technique has been tested using applications from the CHStone benchmark suite and 2 handwritten applications on the Altera Nios soft core processor. The methodology provides an average area reduction of 47.58% compared to a processor with all configuration options enabled. Further, the selection heuristics provide a significant speed up in theorder of 106X as opposed to an existing approach. The various methods proposed earlier have been integrated to facilitate the automatic generation of constraint-aware platforms for application specific FPGA-based CSoC. The automated framework, called Wibheda+, strives to obtain the best performance under user specified area constraint and soft core/hard core processor (if any). Pre-defined code segments can be excluded from hardware acceleration to accommodate design flexibility. Wibheda+ has been tested extensively for a comprehensive evaluation with ARM Cortex A9, Altera Nios and Xilinx MicroBlaze processors. Applications from CHStone benchmark suite have been employed to evaluate Wibheda+ on Cyclone V, Cyclone II and Arria II FPGA devices from Altera as well as Kintex 7 and Artix 7 from Xilinx. Experimental results show that Wibheda+ can consistently identify near-optimal solutions in a few minutes. Wibheda+ can also be relied upon to recommend the most suitable CSoC platform from the supported FPGA devices and processors. Finally, the proposed automatic generation of constraint-aware platforms for application specific FPGA-based CSoC has made it possible to facilitate the rapid design space exploration of complex applications without violating the stringent TTM requirements. Doctor of Philosophy 2019-05-29T02:21:01Z 2019-12-06T17:41:40Z 2019-05-29T02:21:01Z 2019-12-06T17:41:40Z 2019 Thesis Siriwardena Wijesundera, D. S. (2019). Automating FPGA-based CSoC platform generation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/90141 http://hdl.handle.net/10220/48432 10.32657/10220/48432 en 218 p. application/pdf |