Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrollin...

全面介紹

Saved in:

書目詳細資料
Main Authors:	Rafique, Abid, Constantinides, George A., Kapre, Nachiket
其他作者:	School of Computer Engineering
格式:	Article
語言:	English
出版:	2015
主題:	Graphics processing units (GPUs) Iterative numerical methods Spare matrix-vector multiply Matrix powers kernel Field programmable gate arrays (FPGAs)
在線閱讀:	https://hdl.handle.net/10356/81168 http://hdl.handle.net/10220/39128
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-81168
record_format	dspace
spelling	sg-ntu-dr.10356-811682023-12-29T06:51:28Z Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs Rafique, Abid Constantinides, George A. Kapre, Nachiket School of Computer Engineering Graphics processing units (GPUs) Iterative numerical methods Spare matrix-vector multiply Matrix powers kernel Field programmable gate arrays (FPGAs) Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter k, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of k can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9×-42.6× speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large k and hence higher speedups. Our approach for FPGA provides a 0.3×-4.4× speedup over same-generation GPU devices where k is picked carefully for both architectures for a range of benchmarks. Accepted version 2015-12-17T04:59:01Z 2019-12-06T14:22:52Z 2015-12-17T04:59:01Z 2019-12-06T14:22:52Z 2015 Journal Article Rafique, A., Constantinides, G. A., & Kapre, N. (2015). Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs. IEEE Transactions on Parallel and Distributed Systems, 26(1), 24-34. 1045-9219 https://hdl.handle.net/10356/81168 http://hdl.handle.net/10220/39128 10.1109/TPDS.2014.6 en IEEE Transactions on Parallel and Distributed Systems © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [http://dx.doi.org/10.1109/TPDS.2014.6]. 11 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Graphics processing units (GPUs) Iterative numerical methods Spare matrix-vector multiply Matrix powers kernel Field programmable gate arrays (FPGAs)
spellingShingle	Graphics processing units (GPUs) Iterative numerical methods Spare matrix-vector multiply Matrix powers kernel Field programmable gate arrays (FPGAs) Rafique, Abid Constantinides, George A. Kapre, Nachiket Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
description	Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter k, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of k can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9×-42.6× speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large k and hence higher speedups. Our approach for FPGA provides a 0.3×-4.4× speedup over same-generation GPU devices where k is picked carefully for both architectures for a range of benchmarks.
author2	School of Computer Engineering
author_facet	School of Computer Engineering Rafique, Abid Constantinides, George A. Kapre, Nachiket
format	Article
author	Rafique, Abid Constantinides, George A. Kapre, Nachiket
author_sort	Rafique, Abid
title	Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_short	Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_full	Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_fullStr	Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_full_unstemmed	Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs
title_sort	communication optimization of iterative sparse matrix-vector multiply on gpus and fpgas
publishDate	2015
url	https://hdl.handle.net/10356/81168 http://hdl.handle.net/10220/39128
_version_	1787136726808920064

Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

相似書籍