Enhancing performance of Tall-Skinny QR factorization using FPGAs

Communication-avoiding linear algebra algorithms with low communication latency and high memory bandwidth requirements like Tall-Skinny QR factorization (TSQR) are highly appropriate for acceleration using FPGAs. TSQR parallelizes QR factorization of tall-skinny matrices in a divide-and-conquer fash...

Full description

Saved in:

Bibliographic Details
Main Authors:	Rafique, Abid, Kapre, Nachiket, Constantinides, George A.
Other Authors:	School of Computer Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2015
Subjects:	Computer Science and Engineering
Online Access:	https://hdl.handle.net/10356/81242 http://hdl.handle.net/10220/39153
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-81242
record_format	dspace
spelling	sg-ntu-dr.10356-812422020-05-28T07:17:37Z Enhancing performance of Tall-Skinny QR factorization using FPGAs Rafique, Abid Kapre, Nachiket Constantinides, George A. School of Computer Engineering 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) Computer Science and Engineering Communication-avoiding linear algebra algorithms with low communication latency and high memory bandwidth requirements like Tall-Skinny QR factorization (TSQR) are highly appropriate for acceleration using FPGAs. TSQR parallelizes QR factorization of tall-skinny matrices in a divide-and-conquer fashion by decomposing them into sub-matrices, performing local QR factorizations and then merging the intermediate results. As TSQR is a dense linear algebra problem, one would therefore imagine GPU to show better performance. However, the performance of GPU is limited by the memory bandwidth in local QR factorizations and global communication latency in the merge stage. We exploit the shape of the matrix and propose an FPGA-based custom architecture which avoids these bottlenecks by using high-bandwidth on-chip memories for local QR factorizations and by performing the merge stage entirely on-chip to reduce communication latency. We achieve a peak double-precision floating-point performance of 129 GFLOPs on Virtex-6 SX475T. A quantitative comparison of our proposed design with recent QR factorization on FPGAs and GPU shows up to 7.7× and 12.7× speed up respectively. Additionally, we show even higher performance over optimized linear algebra libraries like Intel MKL for multi-cores, CULA for GPUs and MAGMA for hybrid systems. Accepted version 2015-12-18T02:17:35Z 2019-12-06T14:26:22Z 2015-12-18T02:17:35Z 2019-12-06T14:26:22Z 2012 Conference Paper Rafique, A., Kapre, N., & Constantinides, G. A. (2012). Enhancing performance of Tall-Skinny QR factorization using FPGAs. 22nd International Conference on Field Programmable Logic and Applications (FPL), 433-450. https://hdl.handle.net/10356/81242 http://hdl.handle.net/10220/39153 10.1109/FPL.2012.6339142 en © 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [http://dx.doi.org/10.1109/FPL.2012.6339142]. 8 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	Computer Science and Engineering
spellingShingle	Computer Science and Engineering Rafique, Abid Kapre, Nachiket Constantinides, George A. Enhancing performance of Tall-Skinny QR factorization using FPGAs
description	Communication-avoiding linear algebra algorithms with low communication latency and high memory bandwidth requirements like Tall-Skinny QR factorization (TSQR) are highly appropriate for acceleration using FPGAs. TSQR parallelizes QR factorization of tall-skinny matrices in a divide-and-conquer fashion by decomposing them into sub-matrices, performing local QR factorizations and then merging the intermediate results. As TSQR is a dense linear algebra problem, one would therefore imagine GPU to show better performance. However, the performance of GPU is limited by the memory bandwidth in local QR factorizations and global communication latency in the merge stage. We exploit the shape of the matrix and propose an FPGA-based custom architecture which avoids these bottlenecks by using high-bandwidth on-chip memories for local QR factorizations and by performing the merge stage entirely on-chip to reduce communication latency. We achieve a peak double-precision floating-point performance of 129 GFLOPs on Virtex-6 SX475T. A quantitative comparison of our proposed design with recent QR factorization on FPGAs and GPU shows up to 7.7× and 12.7× speed up respectively. Additionally, we show even higher performance over optimized linear algebra libraries like Intel MKL for multi-cores, CULA for GPUs and MAGMA for hybrid systems.
author2	School of Computer Engineering
author_facet	School of Computer Engineering Rafique, Abid Kapre, Nachiket Constantinides, George A.
format	Conference or Workshop Item
author	Rafique, Abid Kapre, Nachiket Constantinides, George A.
author_sort	Rafique, Abid
title	Enhancing performance of Tall-Skinny QR factorization using FPGAs
title_short	Enhancing performance of Tall-Skinny QR factorization using FPGAs
title_full	Enhancing performance of Tall-Skinny QR factorization using FPGAs
title_fullStr	Enhancing performance of Tall-Skinny QR factorization using FPGAs
title_full_unstemmed	Enhancing performance of Tall-Skinny QR factorization using FPGAs
title_sort	enhancing performance of tall-skinny qr factorization using fpgas
publishDate	2015
url	https://hdl.handle.net/10356/81242 http://hdl.handle.net/10220/39153
_version_	1681057596260220928

Enhancing performance of Tall-Skinny QR factorization using FPGAs

Similar Items