Towards faster inference of transformers: Strategies for accelerating decoding processes
This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic com...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/etd_coll/613 https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.etd_coll-1611 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.etd_coll-16112024-09-03T08:08:47Z Towards faster inference of transformers: Strategies for accelerating decoding processes DU, Cunxiao This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications. 2024-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/613 https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers |
spellingShingle |
Large Language Model Neural Network Language Processing General AI Artificial Intelligence and Robotics Programming Languages and Compilers DU, Cunxiao Towards faster inference of transformers: Strategies for accelerating decoding processes |
description |
This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications. |
format |
text |
author |
DU, Cunxiao |
author_facet |
DU, Cunxiao |
author_sort |
DU, Cunxiao |
title |
Towards faster inference of transformers: Strategies for accelerating decoding processes |
title_short |
Towards faster inference of transformers: Strategies for accelerating decoding processes |
title_full |
Towards faster inference of transformers: Strategies for accelerating decoding processes |
title_fullStr |
Towards faster inference of transformers: Strategies for accelerating decoding processes |
title_full_unstemmed |
Towards faster inference of transformers: Strategies for accelerating decoding processes |
title_sort |
towards faster inference of transformers: strategies for accelerating decoding processes |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/etd_coll/613 https://ink.library.smu.edu.sg/context/etd_coll/article/1611/viewcontent/GPIS_AY2019_PhD_CunxiaoDu.pdf |
_version_ |
1814047829531820032 |