Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization

QR factorization is a ubiquitous operation in many engineering and scientific applications. In this paper, we present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflo...

Full description

Saved in:

Bibliographic Details
Main Authors:	Merchant, Farhad, Vatwani, Tarun, Chattopadhyay, Anupam, Raha, Soumyendu, Nandy, Soumitra Kumar, Narayan, Ranjani
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2020
Subjects:	Engineering::Computer science and engineering Parallel Computing Dense Linear Algebra
Online Access:	https://hdl.handle.net/10356/142088
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-142088
record_format	dspace
spelling	sg-ntu-dr.10356-1420882020-06-15T09:19:12Z Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization Merchant, Farhad Vatwani, Tarun Chattopadhyay, Anupam Raha, Soumyendu Nandy, Soumitra Kumar Narayan, Ranjani School of Computer Science and Engineering Engineering::Computer science and engineering Parallel Computing Dense Linear Algebra QR factorization is a ubiquitous operation in many engineering and scientific applications. In this paper, we present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity to re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12 percent better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive. 2020-06-15T09:19:12Z 2020-06-15T09:19:12Z 2018 Journal Article Merchant, F., Vatwani, T., Chattopadhyay, A., Raha, S., Nandy, S. K., & Narayan, R. (2018). Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization. IEEE Transactions on Parallel and Distributed Systems, 29(8), 1707-1720. doi:10.1109/tpds.2018.2803820 1045-9219 https://hdl.handle.net/10356/142088 10.1109/TPDS.2018.2803820 2-s2.0-85041544263 8 29 1707 1720 en IEEE Transactions on Parallel and Distributed Systems © 2018 IEEE. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Parallel Computing Dense Linear Algebra
spellingShingle	Engineering::Computer science and engineering Parallel Computing Dense Linear Algebra Merchant, Farhad Vatwani, Tarun Chattopadhyay, Anupam Raha, Soumyendu Nandy, Soumitra Kumar Narayan, Ranjani Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization
description	QR factorization is a ubiquitous operation in many engineering and scientific applications. In this paper, we present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity to re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12 percent better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Merchant, Farhad Vatwani, Tarun Chattopadhyay, Anupam Raha, Soumyendu Nandy, Soumitra Kumar Narayan, Ranjani
format	Article
author	Merchant, Farhad Vatwani, Tarun Chattopadhyay, Anupam Raha, Soumyendu Nandy, Soumitra Kumar Narayan, Ranjani
author_sort	Merchant, Farhad
title	Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization
title_short	Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization
title_full	Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization
title_fullStr	Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization
title_full_unstemmed	Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization
title_sort	efficient realization of householder transform through algorithm-architecture co-design for acceleration of qr factorization
publishDate	2020
url	https://hdl.handle.net/10356/142088
_version_	1681059451027587072

Efficient realization of householder transform through algorithm-architecture co-design for acceleration of QR factorization

Similar Items