Efficient FPGA-based sparse matrix-vector multiplication with data reuse-aware compression

Sparse matrix-vector multiplication (SpMV) on FPGAs has gained much attention. The performance of SpMV is mainly determined by the number of multiplications between non-zero matrix elements and the corresponding vector values per cycle. On the one side, the off-chip memory bandwidth limits the numbe...

全面介紹

Saved in:
書目詳細資料
Main Authors: Li, Shiqing, Liu, Di, Liu, Weichen
其他作者: School of Computer Science and Engineering
格式: Article
語言:English
出版: 2023
主題:
在線閱讀:https://hdl.handle.net/10356/169152
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
實物特徵
總結:Sparse matrix-vector multiplication (SpMV) on FPGAs has gained much attention. The performance of SpMV is mainly determined by the number of multiplications between non-zero matrix elements and the corresponding vector values per cycle. On the one side, the off-chip memory bandwidth limits the number of non-zero matrix elements transferred from the off-chip DDR to the FPGA chip per cycle. On the other side, the irregular vector access pattern poses challenges to fetch the corresponding vector values. Besides, the read-after-write (RAW) dependency in the accumulation process shall be solved to enable a fully pipelined design. In this work, we propose an efficient FPGA-based sparse matrix-vector multiplication accelerator with data reuse-aware compression. The key observation is that repeated accesses to a vector value can be omitted by reusing the fetched data. Based on the observation, we propose a reordering algorithm to manually exploit the data reuse of fetched vector values. Further, we propose a novel compressed format called data reuse-aware compressed (DRC) to take full advantage of the data reuse and a fast format conversion algorithm to shorten the preprocessing time. Meanwhile, we propose an HLSfriendly accumulator to solve the RAW dependency. Finally, we implement and evaluate our proposed design on the Xilinx Zynq-UltraScale ZCU106 platform with a set of sparse matrices from the SuiteSparse matrix collection. Our proposed design achieves an average 1.18x performance speedup without the DRC format and an average 1.57x performance speedup with the DRC format w.r.t. the state-of-the-art work respectively.