Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs
Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access p...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/155570 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-155570 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1555702022-03-28T07:57:11Z Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs Li, Shiqing Liu, Di Liu, Weichen School of Computer Science and Engineering 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) HP-NTU Digital Manufacturing Corporate Lab Engineering::Computer science and engineering Dataflow Engine Memory Ports Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA’s large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements’ values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix’s nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work. Ministry of Education (MOE) Nanyang Technological University This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071) and Tier 1 (MOE2019-T1-001-072), and partially supported by Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087). 2022-03-07T04:48:52Z 2022-03-07T04:48:52Z 2021 Conference Paper Li, S., Liu, D. & Liu, W. (2021). Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). https://dx.doi.org/10.1109/ICCAD51958.2021.9643453 9781665445078 https://hdl.handle.net/10356/155570 10.1109/ICCAD51958.2021.9643453 2-s2.0-85124164636 en MOE2019-T2-1-071 MOE2019-T1-001-072 M4082282 M4082087 10.21979/N9/ATEYFB © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICCAD51958.2021.9643453. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering Dataflow Engine Memory Ports |
spellingShingle |
Engineering::Computer science and engineering Dataflow Engine Memory Ports Li, Shiqing Liu, Di Liu, Weichen Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs |
description |
Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA’s large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements’ values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix’s nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Li, Shiqing Liu, Di Liu, Weichen |
format |
Conference or Workshop Item |
author |
Li, Shiqing Liu, Di Liu, Weichen |
author_sort |
Li, Shiqing |
title |
Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs |
title_short |
Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs |
title_full |
Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs |
title_fullStr |
Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs |
title_full_unstemmed |
Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs |
title_sort |
optimized data reuse via reordering for sparse matrix-vector multiplication on fpgas |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/155570 |
_version_ |
1729789512769339392 |