An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning

Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that w...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	Li, Shiqing, Zhu, Shien, Luo, Xiangzhong, Luo, Tao, Liu, Weichen
مؤلفون آخرون:	School of Computer Science and Engineering
التنسيق:	Conference or Workshop Item
اللغة:	English
منشور في:	2023
الموضوعات:	Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/172603
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	sg-ntu-dr.10356-172603
record_format	dspace
spelling	sg-ntu-dr.10356-1726032023-12-15T15:36:20Z An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen School of Computer Science and Engineering 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071), and Nanyang Technological University, Singapore, under its NAP (M4082282/04INS000515C130). 2023-12-14T02:01:06Z 2023-12-14T02:01:06Z 2023 Conference Paper Li, S., Zhu, S., Luo, X., Luo, T. & Liu, W. (2023). An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 42-48. https://dx.doi.org/10.1109/FPL60245.2023.00014 9798350341515 https://hdl.handle.net/10356/172603 10.1109/FPL60245.2023.00014 2-s2.0-85178188368 42 48 en MOE2019-T2-1-071 NAP (M4082282/04INS000515C130) 10.21979/N9/MTHKVG © 2023 IEEE. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1109/FPL60245.2023.00014. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth
spellingShingle	Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
description	Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen
format	Conference or Workshop Item
author	Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen
author_sort	Li, Shiqing
title	An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_short	An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_full	An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_fullStr	An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_full_unstemmed	An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
title_sort	efficient sparse lstm accelerator on embedded fpgas with bandwidth-oriented pruning
publishDate	2023
url	https://hdl.handle.net/10356/172603
_version_	1787136508897001472

An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning

مواد مشابهة