Inference acceleration of large language models

This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, m...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Zhang, Boyu
مؤلفون آخرون: Mao Kezhi
التنسيق: Thesis-Master by Coursework
اللغة:English
منشور في: Nanyang Technological University 2024
الموضوعات:
الوصول للمادة أونلاين:https://hdl.handle.net/10356/181660
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
id sg-ntu-dr.10356-181660
record_format dspace
spelling sg-ntu-dr.10356-1816602024-12-13T15:47:32Z Inference acceleration of large language models Zhang, Boyu Mao Kezhi School of Electrical and Electronic Engineering EKZMao@ntu.edu.sg Computer and Information Science Large language model Quantization Approximate computation Self-attention Transformer This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, model structure complexity, and system resource allocation and utilization. Building on this foundation, I review and interpret previous research in this field, systematically summarizing their core ideas, implementation pathways, and achievements. By deeply analyzing these studies, it not only highlight their respective strengths and weaknesses but also propose targeted improvement suggestions in line with current technological trends. Master's degree 2024-12-12T02:30:18Z 2024-12-12T02:30:18Z 2024 Thesis-Master by Coursework Zhang, B. (2024). Inference acceleration of large language models. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181660 https://hdl.handle.net/10356/181660 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Large language model
Quantization
Approximate computation
Self-attention
Transformer
spellingShingle Computer and Information Science
Large language model
Quantization
Approximate computation
Self-attention
Transformer
Zhang, Boyu
Inference acceleration of large language models
description This dissertation delves into the challenges and bottlenecks faced by current large language models during inference from three core perspectives: data, model, and system. Through meticulous research, key factors impacting inference speed are identified, encompassing data processing efficiency, model structure complexity, and system resource allocation and utilization. Building on this foundation, I review and interpret previous research in this field, systematically summarizing their core ideas, implementation pathways, and achievements. By deeply analyzing these studies, it not only highlight their respective strengths and weaknesses but also propose targeted improvement suggestions in line with current technological trends.
author2 Mao Kezhi
author_facet Mao Kezhi
Zhang, Boyu
format Thesis-Master by Coursework
author Zhang, Boyu
author_sort Zhang, Boyu
title Inference acceleration of large language models
title_short Inference acceleration of large language models
title_full Inference acceleration of large language models
title_fullStr Inference acceleration of large language models
title_full_unstemmed Inference acceleration of large language models
title_sort inference acceleration of large language models
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/181660
_version_ 1819112985978732544