Efficient inference offloading for mixture-of-experts large language models in internet of medical things

Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy prote...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yuan, Xiaoming, Kong, Weixuan, Luo, Zhenyu, Xu, Minrui
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2024
Subjects:	Computer and Information Science Large language models Efficient inference offloading
Online Access:	https://hdl.handle.net/10356/179743
Tags:	Add Tag No Tags, Be the first to tag this record!

id	sg-ntu-dr.10356-179743
record_format	dspace
spelling	sg-ntu-dr.10356-1797432024-08-23T15:36:03Z Efficient inference offloading for mixture-of-experts large language models in internet of medical things Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui School of Computer Science and Engineering Computer and Information Science Large language models Efficient inference offloading Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk. Published version This research was supported in part by the National Natural Science Foundation of China (62371116), in part by the Science and Technology Project of Hebei Province Education Department (ZD2022164), and in part by the Project of Hebei Key Laboratory of Software Engineering (22567637H). 2024-08-20T05:28:04Z 2024-08-20T05:28:04Z 2024 Journal Article Yuan, X., Kong, W., Luo, Z. & Xu, M. (2024). Efficient inference offloading for mixture-of-experts large language models in internet of medical things. Electronics, 13(11), 2077-. https://dx.doi.org/10.3390/electronics13112077 2079-9292 https://hdl.handle.net/10356/179743 10.3390/electronics13112077 2-s2.0-85195785333 11 13 2077 en Electronics © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Large language models Efficient inference offloading
spellingShingle	Computer and Information Science Large language models Efficient inference offloading Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui Efficient inference offloading for mixture-of-experts large language models in internet of medical things
description	Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui
format	Article
author	Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui
author_sort	Yuan, Xiaoming
title	Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_short	Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_full	Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_fullStr	Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_full_unstemmed	Efficient inference offloading for mixture-of-experts large language models in internet of medical things
title_sort	efficient inference offloading for mixture-of-experts large language models in internet of medical things
publishDate	2024
url	https://hdl.handle.net/10356/179743
_version_	1814047205580865536

Efficient inference offloading for mixture-of-experts large language models in internet of medical things

Similar Items