Efficient inference offloading for mixture-of-experts large language models in internet of medical things

Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy prote...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuan, Xiaoming, Kong, Weixuan, Luo, Zhenyu, Xu, Minrui
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/179743
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk.