Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrollin...

Full description

Saved in:
Bibliographic Details
Main Authors: Rafique, Abid, Constantinides, George A., Kapre, Nachiket
Other Authors: School of Computer Engineering
Format: Article
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/81168
http://hdl.handle.net/10220/39128
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-81168
record_format dspace
spelling sg-ntu-dr.10356-811682023-12-29T06:51:28Z Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs Rafique, Abid Constantinides, George A. Kapre, Nachiket School of Computer Engineering Graphics processing units (GPUs) Iterative numerical methods Spare matrix-vector multiply Matrix powers kernel Field programmable gate arrays (FPGAs) Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter k, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of k can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9×-42.6× speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large k and hence higher speedups. Our approach for FPGA provides a 0.3×-4.4× speedup over same-generation GPU devices where k is picked carefully for both architectures for a range of benchmarks. Accepted version 2015-12-17T04:59:01Z 2019-12-06T14:22:52Z 2015-12-17T04:59:01Z 2019-12-06T14:22:52Z 2015 Journal Article Rafique, A., Constantinides, G. A., & Kapre, N. (2015). Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs. IEEE Transactions on Parallel and Distributed Systems, 26(1), 24-34. 1045-9219 https://hdl.handle.net/10356/81168 http://hdl.handle.net/10220/39128 10.1109/TPDS.2014.6 en IEEE Transactions on Parallel and Distributed Systems © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [http://dx.doi.org/10.1109/TPDS.2014.6]. 11 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Graphics processing units (GPUs)
Iterative numerical methods
Spare matrix-vector multiply
Matrix powers kernel
Field programmable gate arrays (FPGAs)
spellingShingle Graphics processing units (GPUs)
Iterative numerical methods
Spare matrix-vector multiply
Matrix powers kernel
Field programmable gate arrays (FPGAs)
Rafique, Abid
Constantinides, George A.
Kapre, Nachiket
Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
description Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter k, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of k can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9×-42.6× speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large k and hence higher speedups. Our approach for FPGA provides a 0.3×-4.4× speedup over same-generation GPU devices where k is picked carefully for both architectures for a range of benchmarks.
author2 School of Computer Engineering
author_facet School of Computer Engineering
Rafique, Abid
Constantinides, George A.
Kapre, Nachiket
format Article
author Rafique, Abid
Constantinides, George A.
Kapre, Nachiket
author_sort Rafique, Abid
title Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_short Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_full Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_fullStr Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_full_unstemmed Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_sort communication optimization of iterative sparse matrix-vector multiply on gpus and fpgas
publishDate 2015
url https://hdl.handle.net/10356/81168
http://hdl.handle.net/10220/39128
_version_ 1787136726808920064