A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs

Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of perform...

Full description

Saved in:

Bibliographic Details
Main Authors:	Moorthy, Pradeep, Kapre, Nachiket
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2018
Subjects:	Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/87968 http://hdl.handle.net/10220/45595
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-87968
record_format	dspace
spelling	sg-ntu-dr.10356-879682020-03-07T11:48:52Z A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs Moorthy, Pradeep Kapre, Nachiket School of Computer Science and Engineering Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory accesses with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91–95 MTEPS for graphs as large as 32 million nodes and edges. Published version 2018-08-17T06:34:28Z 2019-12-06T16:53:10Z 2018-08-17T06:34:28Z 2019-12-06T16:53:10Z 2015 Journal Article Moorthy, P., & Kapre, N. (2015). A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs. Supercomputing Frontiers and Innovations, 2(3), 76-86. 2409-6008 https://hdl.handle.net/10356/87968 http://hdl.handle.net/10220/45595 10.14529/jsfi150307 en Supercomputing Frontiers and Innovations © 2015 The Author(s). This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited. 11 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering
spellingShingle	Energy Efficiency Sparse Graphs DRNTU::Engineering::Computer science and engineering Moorthy, Pradeep Kapre, Nachiket A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
description	Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory accesses with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91–95 MTEPS for graphs as large as 32 million nodes and edges.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Moorthy, Pradeep Kapre, Nachiket
format	Article
author	Moorthy, Pradeep Kapre, Nachiket
author_sort	Moorthy, Pradeep
title	A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_short	A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_full	A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_fullStr	A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_full_unstemmed	A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs
title_sort	case for energy-efficient acceleration of graph problems using embedded fpga-based socs
publishDate	2018
url	https://hdl.handle.net/10356/87968 http://hdl.handle.net/10220/45595
_version_	1681047717889966080

A case for energy-efficient acceleration of graph problems using embedded FPGA-based SoCs

Similar Items