Towards faster inference of transformers: Strategies for accelerating decoding processes

This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic com...

Full description

Saved in:
Bibliographic Details
Main Author: DU, Cunxiao
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/613
https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.etd_coll-1611
record_format dspace
spelling sg-smu-ink.etd_coll-16112024-09-03T08:08:47Z Towards faster inference of transformers: Strategies for accelerating decoding processes DU, Cunxiao This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications. 2024-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/613 https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Large Language Model
Neural Network
Language Processing
General AI
Artificial Intelligence and Robotics
Programming Languages and Compilers
spellingShingle Large Language Model
Neural Network
Language Processing
General AI
Artificial Intelligence and Robotics
Programming Languages and Compilers
DU, Cunxiao
Towards faster inference of transformers: Strategies for accelerating decoding processes
description This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications.
format text
author DU, Cunxiao
author_facet DU, Cunxiao
author_sort DU, Cunxiao
title Towards faster inference of transformers: Strategies for accelerating decoding processes
title_short Towards faster inference of transformers: Strategies for accelerating decoding processes
title_full Towards faster inference of transformers: Strategies for accelerating decoding processes
title_fullStr Towards faster inference of transformers: Strategies for accelerating decoding processes
title_full_unstemmed Towards faster inference of transformers: Strategies for accelerating decoding processes
title_sort towards faster inference of transformers: strategies for accelerating decoding processes
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/etd_coll/613
https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf
_version_ 1814047829531820032