Zedwulf: Power-Performance Tradeoffs of a 32-Node Zynq SoC Cluster

Commodity SoCs with hybrid architectures that combine CPUs with programmable FPGA fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for addressing irregular parallelism in graph problems. In this paper, we prototype a 32-node cluster composed from these Zynq SoC...

Full description

Saved in:

Bibliographic Details
Main Authors:	Moorthy, Pradeep, Kapre, Nachiket
Other Authors:	School of Computer Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2015
Subjects:	Computer Science and Engineering
Online Access:	https://hdl.handle.net/10356/83649 http://hdl.handle.net/10220/39205
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	Commodity SoCs with hybrid architectures that combine CPUs with programmable FPGA fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for addressing irregular parallelism in graph problems. In this paper, we prototype a 32-node cluster composed from these Zynq SoC chips to accelerate communication-bound sparse graph-oriented applications such as neural network simulations. We develop specialized MPI routines specifically developed for irregular accelerator-to-accelerator communication of small message traffic. We use the ARM processor for handling the MPI stack while offloading compute-intensive calculations to the FPGA. For graphs with 32M nodes and 32M edges, Zedwulf delivers the highest 94 MTEPS (Million Traversed Edges Per Second)throughput over other x86 multi-threaded platforms in our study by 1.2 -- 1.4×. For this experiment, Zedwulf operates at an efficiency of 0.49 MTEPS/W when using ARM+FPGA which is1.2× better than using ARMv7 CPUs alone, and within 8% of the Intel Core i7-4770k platform.

Zedwulf: Power-Performance Tradeoffs of a 32-Node Zynq SoC Cluster

Similar Items