TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge

Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power and memory-efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked thei...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhu, Shien, Duong, Luan H. K., Liu, Weichen
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2022
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computer systems organization::Performance of systems Edge Computing Ternary Neural Networks Binary Neural Networks
Online Access:	https://hdl.handle.net/10356/155648 https://doi.org/10.21979/N9/RZ75BY
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-155648
record_format	dspace
spelling	sg-ntu-dr.10356-1556482022-05-11T06:54:50Z TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge Zhu, Shien Duong, Luan H. K. Liu, Weichen School of Computer Science and Engineering Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computer systems organization::Performance of systems Edge Computing Ternary Neural Networks Binary Neural Networks Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power and memory-efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked their optimizations on CPU and GPU platforms. First, there is no unified encoding for the binary and ternary values in TNNs and TBNs. Second, existing works store the 2-bit quantized data sequentially in 32/64-bit integers, resulting in bit-extraction overhead. Last, adopting standard 2-bit multiplications for ternary values leads to a complex computation pipeline, and efficient mixed-precision multiplication between ternary and binary values is unavailable. In this paper, we propose TAB as a unified and optimized inference method for ternary, binary and mixed-precision neural networks. TAB includes unified value representation, efficient data storage scheme, and novel bitwise dot product pipelines on CPU/GPU platforms. We adopt signed integers for consistent value representation across binary and ternary values. We introduce a bitwidth-last data format that stores the first and second bits of the ternary values separately to remove the bit extraction overhead. We design the ternary and binary bitwise dot product pipelines based on Gated-XOR using up to 40% fewer operations than State-Of-The-Art (SOTA) methods. Theoretical speedup analysis shows that our proposed TAB-TNN is 2.3X fast as the SOTA ternary method RTN, 9.8X fast as 8-bit integer quantization (INT8), and 39.4X fast as 32-bit full-precision convolution (FP32). Experiment results on CPU and GPU platforms show that our TAB-TNN has achieved up to 34.6X speedup and 16X storage size reduction compared with FP32 layers. TBN, Binary-activation Ternary-weight Network (BTN) and BNN in TAB are up to 40.7X, 56.2X and 72.2X fast as FP32. TAB-TNN is up to 70.1% faster and 12.8% more power-efficient than RTN on Darknet-19 while keeping the same accuracy. TAB is open source as a PyTorch Extension for easy integration with existing CNN models. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071) and Tier 1 (MOE2019-T1-001-072), and partially supported by Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087). 2022-03-16T06:33:44Z 2022-03-16T06:33:44Z 2022 Journal Article Zhu, S., Duong, L. H. K. & Liu, W. (2022). TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge. ACM Transactions On Embedded Computing Systems. https://dx.doi.org/10.1145/3508390 1539-9087 https://hdl.handle.net/10356/155648 10.1145/3508390 en MOE2019-T2-1-071 MOE2019-T1-001-072 M4082282 M4082087 ACM Transactions on Embedded Computing Systems https://doi.org/10.21979/N9/RZ75BY © 2022 Association for Computing Machinery. All rights reserved. This paper was published in ACM Transactions on Embedded Computing Systems and is made available with permission of Association for Computing Machinery. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computer systems organization::Performance of systems Edge Computing Ternary Neural Networks Binary Neural Networks
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computer systems organization::Performance of systems Edge Computing Ternary Neural Networks Binary Neural Networks Zhu, Shien Duong, Luan H. K. Liu, Weichen TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
description	Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power and memory-efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked their optimizations on CPU and GPU platforms. First, there is no unified encoding for the binary and ternary values in TNNs and TBNs. Second, existing works store the 2-bit quantized data sequentially in 32/64-bit integers, resulting in bit-extraction overhead. Last, adopting standard 2-bit multiplications for ternary values leads to a complex computation pipeline, and efficient mixed-precision multiplication between ternary and binary values is unavailable. In this paper, we propose TAB as a unified and optimized inference method for ternary, binary and mixed-precision neural networks. TAB includes unified value representation, efficient data storage scheme, and novel bitwise dot product pipelines on CPU/GPU platforms. We adopt signed integers for consistent value representation across binary and ternary values. We introduce a bitwidth-last data format that stores the first and second bits of the ternary values separately to remove the bit extraction overhead. We design the ternary and binary bitwise dot product pipelines based on Gated-XOR using up to 40% fewer operations than State-Of-The-Art (SOTA) methods. Theoretical speedup analysis shows that our proposed TAB-TNN is 2.3X fast as the SOTA ternary method RTN, 9.8X fast as 8-bit integer quantization (INT8), and 39.4X fast as 32-bit full-precision convolution (FP32). Experiment results on CPU and GPU platforms show that our TAB-TNN has achieved up to 34.6X speedup and 16X storage size reduction compared with FP32 layers. TBN, Binary-activation Ternary-weight Network (BTN) and BNN in TAB are up to 40.7X, 56.2X and 72.2X fast as FP32. TAB-TNN is up to 70.1% faster and 12.8% more power-efficient than RTN on Darknet-19 while keeping the same accuracy. TAB is open source as a PyTorch Extension for easy integration with existing CNN models.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Zhu, Shien Duong, Luan H. K. Liu, Weichen
format	Article
author	Zhu, Shien Duong, Luan H. K. Liu, Weichen
author_sort	Zhu, Shien
title	TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
title_short	TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
title_full	TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
title_fullStr	TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
title_full_unstemmed	TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
title_sort	tab : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
publishDate	2022
url	https://hdl.handle.net/10356/155648 https://doi.org/10.21979/N9/RZ75BY
_version_	1734310191766175744

TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge

Similar Items