Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs

Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access p...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Shiqing, Liu, Di, Liu, Weichen
Other Authors:	School of Computer Science and Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2022
Subjects:	Engineering::Computer science and engineering Dataflow Engine Memory Ports
Online Access:	https://hdl.handle.net/10356/155570
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-155570
record_format	dspace
spelling	sg-ntu-dr.10356-1555702022-03-28T07:57:11Z Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs Li, Shiqing Liu, Di Liu, Weichen School of Computer Science and Engineering 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) HP-NTU Digital Manufacturing Corporate Lab Engineering::Computer science and engineering Dataflow Engine Memory Ports Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA’s large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements’ values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix’s nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work. Ministry of Education (MOE) Nanyang Technological University This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071) and Tier 1 (MOE2019-T1-001-072), and partially supported by Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087). 2022-03-07T04:48:52Z 2022-03-07T04:48:52Z 2021 Conference Paper Li, S., Liu, D. & Liu, W. (2021). Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). https://dx.doi.org/10.1109/ICCAD51958.2021.9643453 9781665445078 https://hdl.handle.net/10356/155570 10.1109/ICCAD51958.2021.9643453 2-s2.0-85124164636 en MOE2019-T2-1-071 MOE2019-T1-001-072 M4082282 M4082087 10.21979/N9/ATEYFB © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICCAD51958.2021.9643453. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Dataflow Engine Memory Ports
spellingShingle	Engineering::Computer science and engineering Dataflow Engine Memory Ports Li, Shiqing Liu, Di Liu, Weichen Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
description	Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA’s large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements’ values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix’s nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Li, Shiqing Liu, Di Liu, Weichen
format	Conference or Workshop Item
author	Li, Shiqing Liu, Di Liu, Weichen
author_sort	Li, Shiqing
title	Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_short	Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_full	Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_fullStr	Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_full_unstemmed	Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_sort	optimized data reuse via reordering for sparse matrix-vector multiplication on fpgas
publishDate	2022
url	https://hdl.handle.net/10356/155570
_version_	1729789512769339392

Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs

Similar Items