Towards faster inference of transformers: Strategies for accelerating decoding processes

This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic com...

Full description

Saved in:

Bibliographic Details
Main Author:	DU, Cunxiao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers
Online Access:	https://ink.library.smu.edu.sg/etd_coll/613 https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.etd_coll-1611
record_format	dspace
spelling	sg-smu-ink.etd_coll-16112024-09-03T08:08:47Z Towards faster inference of transformers: Strategies for accelerating decoding processes DU, Cunxiao This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications. 2024-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/613 https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers
spellingShingle	Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers DU, Cunxiao Towards faster inference of transformers: Strategies for accelerating decoding processes
description	This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications.
format	text
author	DU, Cunxiao
author_facet	DU, Cunxiao
author_sort	DU, Cunxiao
title	Towards faster inference of transformers: Strategies for accelerating decoding processes
title_short	Towards faster inference of transformers: Strategies for accelerating decoding processes
title_full	Towards faster inference of transformers: Strategies for accelerating decoding processes
title_fullStr	Towards faster inference of transformers: Strategies for accelerating decoding processes
title_full_unstemmed	Towards faster inference of transformers: Strategies for accelerating decoding processes
title_sort	towards faster inference of transformers: strategies for accelerating decoding processes
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/etd_coll/613 https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf
_version_	1814047829531820032

Towards faster inference of transformers: Strategies for accelerating decoding processes

Similar Items