An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning

Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that w...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Shiqing, Zhu, Shien, Luo, Xiangzhong, Luo, Tao, Liu, Weichen
Other Authors: School of Computer Science and Engineering
Format: Conference or Workshop Item
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172603
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-172603
record_format dspace
spelling sg-ntu-dr.10356-1726032023-12-15T15:36:20Z An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen School of Computer Science and Engineering 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071), and Nanyang Technological University, Singapore, under its NAP (M4082282/04INS000515C130). 2023-12-14T02:01:06Z 2023-12-14T02:01:06Z 2023 Conference Paper Li, S., Zhu, S., Luo, X., Luo, T. & Liu, W. (2023). An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 42-48. https://dx.doi.org/10.1109/FPL60245.2023.00014 9798350341515 https://hdl.handle.net/10356/172603 10.1109/FPL60245.2023.00014 2-s2.0-85178188368 42 48 en MOE2019-T2-1-071 NAP (M4082282/04INS000515C130) 10.21979/N9/MTHKVG © 2023 IEEE. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1109/FPL60245.2023.00014. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Engineering::Computer science and engineering::Hardware
Sparse LSTM
Pruning
Bandwidth
spellingShingle Engineering::Computer science and engineering
Engineering::Computer science and engineering::Hardware
Sparse LSTM
Pruning
Bandwidth
Li, Shiqing
Zhu, Shien
Luo, Xiangzhong
Luo, Tao
Liu, Weichen
An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
description Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Li, Shiqing
Zhu, Shien
Luo, Xiangzhong
Luo, Tao
Liu, Weichen
format Conference or Workshop Item
author Li, Shiqing
Zhu, Shien
Luo, Xiangzhong
Luo, Tao
Liu, Weichen
author_sort Li, Shiqing
title An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_short An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_full An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_fullStr An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_full_unstemmed An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_sort efficient sparse lstm accelerator on embedded fpgas with bandwidth-oriented pruning
publishDate 2023
url https://hdl.handle.net/10356/172603
_version_ 1787136508897001472