Inference acceleration of large language models

This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, m...

Full description

Saved in:
Bibliographic Details
Main Author: Zhang, Boyu
Other Authors: Mao Kezhi
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181660
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-181660
record_format dspace
spelling sg-ntu-dr.10356-1816602024-12-13T15:47:32Z Inference acceleration of large language models Zhang, Boyu Mao Kezhi School of Electrical and Electronic Engineering EKZMao@ntu.edu.sg Computer and Information Science Large language model Quantization Approximate computation Self-attention Transformer This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, model structure complexity, and system resource allocation and utilization. Building on this foundation, I review and interpret previous research in this field, systematically summarizing their core ideas, implementation pathways, and achievements. By deeply analyzing these studies, it not only highlight their respective strengths and weaknesses but also propose targeted improvement suggestions in line with current technological trends. Master's degree 2024-12-12T02:30:18Z 2024-12-12T02:30:18Z 2024 Thesis-Master by Coursework Zhang, B. (2024). Inference acceleration of large language models. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181660 https://hdl.handle.net/10356/181660 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Large language model
Quantization
Approximate computation
Self-attention
Transformer
spellingShingle Computer and Information Science
Large language model
Quantization
Approximate computation
Self-attention
Transformer
Zhang, Boyu
Inference acceleration of large language models
description This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, model structure complexity, and system resource allocation and utilization. Building on this foundation, I review and interpret previous research in this field, systematically summarizing their core ideas, implementation pathways, and achievements. By deeply analyzing these studies, it not only highlight their respective strengths and weaknesses but also propose targeted improvement suggestions in line with current technological trends.
author2 Mao Kezhi
author_facet Mao Kezhi
Zhang, Boyu
format Thesis-Master by Coursework
author Zhang, Boyu
author_sort Zhang, Boyu
title Inference acceleration of large language models
title_short Inference acceleration of large language models
title_full Inference acceleration of large language models
title_fullStr Inference acceleration of large language models
title_full_unstemmed Inference acceleration of large language models
title_sort inference acceleration of large language models
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/181660
_version_ 1819112985978732544