Accelerating BLAS and LAPACK via efficient floating point architecture design

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic...

Full description

Saved in:

Bibliographic Details
Main Authors:	Merchant, Farhad, Chattopadhyay, Anupam, Raha, Soumyendu, Nandy, S. K., Narayan, Ranjani
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2020
Subjects:	Engineering::Computer science and engineering Parallel Computing Instruction Level Parallelism
Online Access:	https://hdl.handle.net/10356/141525
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-141525
record_format	dspace
spelling	sg-ntu-dr.10356-1415252020-06-09T02:48:47Z Accelerating BLAS and LAPACK via efficient floating point architecture design Merchant, Farhad Chattopadhyay, Anupam Raha, Soumyendu Nandy, S. K. Narayan, Ranjani School of Computer Science and Engineering Engineering::Computer science and engineering Parallel Computing Instruction Level Parallelism Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE. 2020-06-09T02:48:47Z 2020-06-09T02:48:47Z 2017 Journal Article Merchant, F., Chattopadhyay, A., Raha, S., Nandy, S. K., & Narayan, R. (2017). Accelerating BLAS and LAPACK via efficient floating point architecture design. Parallel Processing Letters, 27(3&4), 1750006-. doi:10.1142/S0129626417500062 0129-6264 https://hdl.handle.net/10356/141525 10.1142/S0129626417500062 2-s2.0-85038103631 3-4 27 en Parallel Processing Letters © 2017 World Scientific Publishing Company. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Parallel Computing Instruction Level Parallelism
spellingShingle	Engineering::Computer science and engineering Parallel Computing Instruction Level Parallelism Merchant, Farhad Chattopadhyay, Anupam Raha, Soumyendu Nandy, S. K. Narayan, Ranjani Accelerating BLAS and LAPACK via efficient floating point architecture design
description	Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Merchant, Farhad Chattopadhyay, Anupam Raha, Soumyendu Nandy, S. K. Narayan, Ranjani
format	Article
author	Merchant, Farhad Chattopadhyay, Anupam Raha, Soumyendu Nandy, S. K. Narayan, Ranjani
author_sort	Merchant, Farhad
title	Accelerating BLAS and LAPACK via efficient floating point architecture design
title_short	Accelerating BLAS and LAPACK via efficient floating point architecture design
title_full	Accelerating BLAS and LAPACK via efficient floating point architecture design
title_fullStr	Accelerating BLAS and LAPACK via efficient floating point architecture design
title_full_unstemmed	Accelerating BLAS and LAPACK via efficient floating point architecture design
title_sort	accelerating blas and lapack via efficient floating point architecture design
publishDate	2020
url	https://hdl.handle.net/10356/141525
_version_	1681059215775367168

Accelerating BLAS and LAPACK via efficient floating point architecture design

Similar Items