An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning

Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that w...

全面介紹

Saved in:

書目詳細資料
Main Authors:	Li, Shiqing, Zhu, Shien, Luo, Xiangzhong, Luo, Tao, Liu, Weichen
其他作者:	School of Computer Science and Engineering
格式:	Conference or Workshop Item
語言:	English
出版:	2023
主題:	Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware Sparse LSTM Pruning Bandwidth
在線閱讀:	https://hdl.handle.net/10356/172603
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

實物特徵
總結:	Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work.

An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning

相似書籍