Hardware-constrained edge deep learning

Neural Networks have become commonplace in our daily lives, powering everything from language models in chatbots to computer vision models in industrial machinery. The unending quest for greater model performance has led to an exponential growth in model size. For many devices, especially edge dev...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Ng, Jia Rui
مؤلفون آخرون: Weichen Liu
التنسيق: Final Year Project
اللغة:English
منشور في: Nanyang Technological University 2024
الموضوعات:
الوصول للمادة أونلاين:https://hdl.handle.net/10356/181190
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
الوصف
الملخص:Neural Networks have become commonplace in our daily lives, powering everything from language models in chatbots to computer vision models in industrial machinery. The unending quest for greater model performance has led to an exponential growth in model size. For many devices, especially edge devices, storing or even running these models in a performant manner proves to be a challenge. In this paper, various memory compression methods, centered around post-training quantization, are explored for Large Language Models (LLMs) by comparing accuracy (perplexity) and inference latency (token generation speed). The report concludes that most LLMs can be quantized significantly without an observable loss in accuracy. However, very aggressive quantization (\<=3 bits) can lead to rambling responses and a significant degradation in user experience. Further work can also be done to explore kernel-level quantization for convolutional neural networks and pseudo-vectorization for embedded use cases.