Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs

Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access p...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Shiqing, Liu, Di, Liu, Weichen
Other Authors: School of Computer Science and Engineering
Format: Conference or Workshop Item
Language:English
Published: 2022
Subjects:
Online Access:https://hdl.handle.net/10356/155570
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-155570
record_format dspace
spelling sg-ntu-dr.10356-1555702022-03-28T07:57:11Z Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs Li, Shiqing Liu, Di Liu, Weichen School of Computer Science and Engineering 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) HP-NTU Digital Manufacturing Corporate Lab Engineering::Computer science and engineering Dataflow Engine Memory Ports Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA’s large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements’ values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix’s nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work. Ministry of Education (MOE) Nanyang Technological University This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071) and Tier 1 (MOE2019-T1-001-072), and partially supported by Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087). 2022-03-07T04:48:52Z 2022-03-07T04:48:52Z 2021 Conference Paper Li, S., Liu, D. & Liu, W. (2021). Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). https://dx.doi.org/10.1109/ICCAD51958.2021.9643453 9781665445078 https://hdl.handle.net/10356/155570 10.1109/ICCAD51958.2021.9643453 2-s2.0-85124164636 en MOE2019-T2-1-071 MOE2019-T1-001-072 M4082282 M4082087 10.21979/N9/ATEYFB © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICCAD51958.2021.9643453. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Dataflow Engine
Memory Ports
spellingShingle Engineering::Computer science and engineering
Dataflow Engine
Memory Ports
Li, Shiqing
Liu, Di
Liu, Weichen
Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
description Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA’s large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements’ values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix’s nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Li, Shiqing
Liu, Di
Liu, Weichen
format Conference or Workshop Item
author Li, Shiqing
Liu, Di
Liu, Weichen
author_sort Li, Shiqing
title Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_short Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_full Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_fullStr Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_full_unstemmed Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
title_sort optimized data reuse via reordering for sparse matrix-vector multiplication on fpgas
publishDate 2022
url https://hdl.handle.net/10356/155570
_version_ 1729789512769339392