A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of perform...
Saved in:
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/87968 http://hdl.handle.net/10220/45595 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-87968 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-879682020-03-07T11:48:52Z A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs Moorthy, Pradeep Kapre, Nachiket School of Computer Science and Engineering Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory accesses with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91–95 MTEPS for graphs as large as 32 million nodes and edges. Published version 2018-08-17T06:34:28Z 2019-12-06T16:53:10Z 2018-08-17T06:34:28Z 2019-12-06T16:53:10Z 2015 Journal Article Moorthy, P., & Kapre, N. (2015). A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs. Supercomputing Frontiers and Innovations, 2(3), 76-86. 2409-6008 https://hdl.handle.net/10356/87968 http://hdl.handle.net/10220/45595 10.14529/jsfi150307 en Supercomputing Frontiers and Innovations © 2015 The Author(s). This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited. 11 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
country |
Singapore |
collection |
DR-NTU |
language |
English |
topic |
Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering |
spellingShingle |
Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering Moorthy, Pradeep Kapre, Nachiket A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs |
description |
Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory accesses with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91–95 MTEPS for graphs as large as 32 million nodes and edges. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Moorthy, Pradeep Kapre, Nachiket |
format |
Article |
author |
Moorthy, Pradeep Kapre, Nachiket |
author_sort |
Moorthy, Pradeep |
title |
A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs |
title_short |
A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs |
title_full |
A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs |
title_fullStr |
A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs |
title_full_unstemmed |
A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs |
title_sort |
case for energy-efficient acceleration of graph problems using embedded fpga-based socs |
publishDate |
2018 |
url |
https://hdl.handle.net/10356/87968 http://hdl.handle.net/10220/45595 |
_version_ |
1681047717889966080 |