An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that w...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172603 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-172603 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1726032023-12-15T15:36:20Z An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen School of Computer Science and Engineering 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071), and Nanyang Technological University, Singapore, under its NAP (M4082282/04INS000515C130). 2023-12-14T02:01:06Z 2023-12-14T02:01:06Z 2023 Conference Paper Li, S., Zhu, S., Luo, X., Luo, T. & Liu, W. (2023). An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 42-48. https://dx.doi.org/10.1109/FPL60245.2023.00014 9798350341515 https://hdl.handle.net/10356/172603 10.1109/FPL60245.2023.00014 2-s2.0-85178188368 42 48 en MOE2019-T2-1-071 NAP (M4082282/04INS000515C130) 10.21979/N9/MTHKVG © 2023 IEEE. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1109/FPL60245.2023.00014. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth |
spellingShingle |
Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning |
description |
Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen |
format |
Conference or Workshop Item |
author |
Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen |
author_sort |
Li, Shiqing |
title |
An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning |
title_short |
An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning |
title_full |
An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning |
title_fullStr |
An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning |
title_full_unstemmed |
An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning |
title_sort |
efficient sparse lstm accelerator on embedded fpgas with bandwidth-oriented pruning |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/172603 |
_version_ |
1787136508897001472 |