A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs

Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of perform...

Full description

Saved in:
Bibliographic Details
Main Authors: Moorthy, Pradeep, Kapre, Nachiket
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2018
Subjects:
Online Access:https://hdl.handle.net/10356/87968
http://hdl.handle.net/10220/45595
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-87968
record_format dspace
spelling sg-ntu-dr.10356-879682020-03-07T11:48:52Z A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs Moorthy, Pradeep Kapre, Nachiket School of Computer Science and Engineering Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory accesses with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91–95 MTEPS for graphs as large as 32 million nodes and edges. Published version 2018-08-17T06:34:28Z 2019-12-06T16:53:10Z 2018-08-17T06:34:28Z 2019-12-06T16:53:10Z 2015 Journal Article Moorthy, P., & Kapre, N. (2015). A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs. Supercomputing Frontiers and Innovations, 2(3), 76-86. 2409-6008 https://hdl.handle.net/10356/87968 http://hdl.handle.net/10220/45595 10.14529/jsfi150307 en Supercomputing Frontiers and Innovations © 2015 The Author(s). This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited. 11 p. application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic Energy Efficiency
Sparse Graphs
DRNTU::Engineering::Computer science and engineering
spellingShingle Energy Efficiency
Sparse Graphs
DRNTU::Engineering::Computer science and engineering
Moorthy, Pradeep
Kapre, Nachiket
A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
description Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory accesses with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91–95 MTEPS for graphs as large as 32 million nodes and edges.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Moorthy, Pradeep
Kapre, Nachiket
format Article
author Moorthy, Pradeep
Kapre, Nachiket
author_sort Moorthy, Pradeep
title A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_short A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_full A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_fullStr A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_full_unstemmed A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_sort case for energy-efficient acceleration of graph problems using embedded fpga-based socs
publishDate 2018
url https://hdl.handle.net/10356/87968
http://hdl.handle.net/10220/45595
_version_ 1681047717889966080