Efficient inference offloading for mixture-of-experts large language models in internet of medical things

Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy prote...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuan, Xiaoming, Kong, Weixuan, Luo, Zhenyu, Xu, Minrui
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/179743
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-179743
record_format dspace
spelling sg-ntu-dr.10356-1797432024-08-23T15:36:03Z Efficient inference offloading for mixture-of-experts large language models in internet of medical things Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui School of Computer Science and Engineering Computer and Information Science Large language models Efficient inference offloading Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk. Published version This research was supported in part by the National Natural Science Foundation of China (62371116), in part by the Science and Technology Project of Hebei Province Education Department (ZD2022164), and in part by the Project of Hebei Key Laboratory of Software Engineering (22567637H). 2024-08-20T05:28:04Z 2024-08-20T05:28:04Z 2024 Journal Article Yuan, X., Kong, W., Luo, Z. & Xu, M. (2024). Efficient inference offloading for mixture-of-experts large language models in internet of medical things. Electronics, 13(11), 2077-. https://dx.doi.org/10.3390/electronics13112077 2079-9292 https://hdl.handle.net/10356/179743 10.3390/electronics13112077 2-s2.0-85195785333 11 13 2077 en Electronics © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Large language models
Efficient inference offloading
spellingShingle Computer and Information Science
Large language models
Efficient inference offloading
Yuan, Xiaoming
Kong, Weixuan
Luo, Zhenyu
Xu, Minrui
Efficient inference offloading for mixture-of-experts large language models in internet of medical things
description Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Yuan, Xiaoming
Kong, Weixuan
Luo, Zhenyu
Xu, Minrui
format Article
author Yuan, Xiaoming
Kong, Weixuan
Luo, Zhenyu
Xu, Minrui
author_sort Yuan, Xiaoming
title Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_short Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_full Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_fullStr Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_full_unstemmed Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_sort efficient inference offloading for mixture-of-experts large language models in internet of medical things
publishDate 2024
url https://hdl.handle.net/10356/179743
_version_ 1814047205580865536