Inference acceleration of large language models

This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, m...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhang, Boyu
Other Authors:	Mao Kezhi
Format:	Thesis-Master by Coursework
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Large language model Quantization Approximate computation Self-attention Transformer
Online Access:	https://hdl.handle.net/10356/181660
Tags:	Add Tag No Tags, Be the first to tag this record!

id	sg-ntu-dr.10356-181660
record_format	dspace
spelling	sg-ntu-dr.10356-1816602024-12-13T15:47:32Z Inference acceleration of large language models Zhang, Boyu Mao Kezhi School of Electrical and Electronic Engineering EKZMao@ntu.edu.sg Computer and Information Science Large language model Quantization Approximate computation Self-attention Transformer This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, model structure complexity, and system resource allocation and utilization. Building on this foundation, I review and interpret previous research in this field, systematically summarizing their core ideas, implementation pathways, and achievements. By deeply analyzing these studies, it not only highlight their respective strengths and weaknesses but also propose targeted improvement suggestions in line with current technological trends. Master's degree 2024-12-12T02:30:18Z 2024-12-12T02:30:18Z 2024 Thesis-Master by Coursework Zhang, B. (2024). Inference acceleration of large language models. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181660 https://hdl.handle.net/10356/181660 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Large language model Quantization Approximate computation Self-attention Transformer
spellingShingle	Computer and Information Science Large language model Quantization Approximate computation Self-attention Transformer Zhang, Boyu Inference acceleration of large language models
description	This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, model structure complexity, and system resource allocation and utilization. Building on this foundation, I review and interpret previous research in this field, systematically summarizing their core ideas, implementation pathways, and achievements. By deeply analyzing these studies, it not only highlight their respective strengths and weaknesses but also propose targeted improvement suggestions in line with current technological trends.
author2	Mao Kezhi
author_facet	Mao Kezhi Zhang, Boyu
format	Thesis-Master by Coursework
author	Zhang, Boyu
author_sort	Zhang, Boyu
title	Inference acceleration of large language models
title_short	Inference acceleration of large language models
title_full	Inference acceleration of large language models
title_fullStr	Inference acceleration of large language models
title_full_unstemmed	Inference acceleration of large language models
title_sort	inference acceleration of large language models
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/181660
_version_	1819112985978732544

Inference acceleration of large language models

Similar Items