An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs

Sparse-matrix sparse-matrix multiplication (SpMM) is an important kernel in multiple areas, e.g., data analytics and machine learning. Due to the low on-chip memory requirement, the consistent data format, and the simplified control logic, the Gustavson’s algorithm is a promising backbone algorithm...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Shiqing, Huai, Shuo, Liu, Weichen
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2023
Subjects:	Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware SpMM FPGA Gustavson Dataflow
Online Access:	https://hdl.handle.net/10356/169151
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-169151
record_format	dspace
spelling	sg-ntu-dr.10356-1691512023-09-20T10:15:00Z An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs Li, Shiqing Huai, Shuo Liu, Weichen School of Computer Science and Engineering Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware SpMM FPGA Gustavson Dataflow Sparse-matrix sparse-matrix multiplication (SpMM) is an important kernel in multiple areas, e.g., data analytics and machine learning. Due to the low on-chip memory requirement, the consistent data format, and the simplified control logic, the Gustavson’s algorithm is a promising backbone algorithm for SpMM on hardware accelerators. However, the off-chip memory traffic still limits the performance of the algorithm, especially on embedded FPGAs. Previous researchers optimize the Gustavson’s algorithm targeting high bandwidth memory-based architectures and their solutions cannot be directly applied to embedded FPGAs with traditional DDRs. In this work, we propose an efficient Gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs. The proposed design fully considers the feature of off-chip memory access on embedded FPGAs and the dataflow of the Gustavson’s algorithm. At first, we analyze the parallelism of the algorithm and propose to perform the algorithm with element-wise parallelism, which reduces the idle time of processing elements caused by synchronization. Further, we show a counter-intuitive example that the traditional cache leads to worse performance. Then, we propose a novel access pattern-aware cache scheme called SpCache, which provides quick responses to reduce bank conflicts caused by irregular memory accesses and combines streaming and caching to handle requests that access ordered elements of unpredictable length. Moreover, we propose to perform the merge on part of partial results, which removes some redundant merges in the naive implementation and has little postprocessing overhead. Finally, we conduct experiments on the Xilinx Zynq-UltraScale ZCU106 platform with a set of benchmarks from the SuiteSparse matrix collection. The experimental results show that the proposed design achieves an average 1.75x performance speedup compared to the baseline. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071), and Nanyang Technological University, Singapore, under its NAP (M4082282/04INS000515C130). 2023-07-05T04:49:41Z 2023-07-05T04:49:41Z 2023 Journal Article Li, S., Huai, S. & Liu, W. (2023). An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs. IEEE Transactions On Computer-Aided Design of Integrated Circuits and Systems. https://dx.doi.org/10.1109/TCAD.2023.3281719 0278-0070 https://hdl.handle.net/10356/169151 10.1109/TCAD.2023.3281719 2-s2.0-85161016785 en MOE2019-T2-1-071 NAP (M4082282/04INS000515C130) IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10.21979/N9/RKPHSM © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/TCAD.2023.3281719. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware SpMM FPGA Gustavson Dataflow
spellingShingle	Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware SpMM FPGA Gustavson Dataflow Li, Shiqing Huai, Shuo Liu, Weichen An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs
description	Sparse-matrix sparse-matrix multiplication (SpMM) is an important kernel in multiple areas, e.g., data analytics and machine learning. Due to the low on-chip memory requirement, the consistent data format, and the simplified control logic, the Gustavson’s algorithm is a promising backbone algorithm for SpMM on hardware accelerators. However, the off-chip memory traffic still limits the performance of the algorithm, especially on embedded FPGAs. Previous researchers optimize the Gustavson’s algorithm targeting high bandwidth memory-based architectures and their solutions cannot be directly applied to embedded FPGAs with traditional DDRs. In this work, we propose an efficient Gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs. The proposed design fully considers the feature of off-chip memory access on embedded FPGAs and the dataflow of the Gustavson’s algorithm. At first, we analyze the parallelism of the algorithm and propose to perform the algorithm with element-wise parallelism, which reduces the idle time of processing elements caused by synchronization. Further, we show a counter-intuitive example that the traditional cache leads to worse performance. Then, we propose a novel access pattern-aware cache scheme called SpCache, which provides quick responses to reduce bank conflicts caused by irregular memory accesses and combines streaming and caching to handle requests that access ordered elements of unpredictable length. Moreover, we propose to perform the merge on part of partial results, which removes some redundant merges in the naive implementation and has little postprocessing overhead. Finally, we conduct experiments on the Xilinx Zynq-UltraScale ZCU106 platform with a set of benchmarks from the SuiteSparse matrix collection. The experimental results show that the proposed design achieves an average 1.75x performance speedup compared to the baseline.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Li, Shiqing Huai, Shuo Liu, Weichen
format	Article
author	Li, Shiqing Huai, Shuo Liu, Weichen
author_sort	Li, Shiqing
title	An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs
title_short	An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs
title_full	An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs
title_fullStr	An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs
title_full_unstemmed	An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs
title_sort	efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded fpgas
publishDate	2023
url	https://hdl.handle.net/10356/169151
_version_	1779156698694418432

An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs

Similar Items