Efficient inference offloading for mixture-of-experts large language models in internet of medical things
Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy prote...
Saved in:
Main Authors: | , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/179743 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-179743 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1797432024-08-23T15:36:03Z Efficient inference offloading for mixture-of-experts large language models in internet of medical things Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui School of Computer Science and Engineering Computer and Information Science Large language models Efficient inference offloading Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk. Published version This research was supported in part by the National Natural Science Foundation of China (62371116), in part by the Science and Technology Project of Hebei Province Education Department (ZD2022164), and in part by the Project of Hebei Key Laboratory of Software Engineering (22567637H). 2024-08-20T05:28:04Z 2024-08-20T05:28:04Z 2024 Journal Article Yuan, X., Kong, W., Luo, Z. & Xu, M. (2024). Efficient inference offloading for mixture-of-experts large language models in internet of medical things. Electronics, 13(11), 2077-. https://dx.doi.org/10.3390/electronics13112077 2079-9292 https://hdl.handle.net/10356/179743 10.3390/electronics13112077 2-s2.0-85195785333 11 13 2077 en Electronics © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Large language models Efficient inference offloading |
spellingShingle |
Computer and Information Science Large language models Efficient inference offloading Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui Efficient inference offloading for mixture-of-experts large language models in internet of medical things |
description |
Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui |
format |
Article |
author |
Yuan, Xiaoming Kong, Weixuan Luo, Zhenyu Xu, Minrui |
author_sort |
Yuan, Xiaoming |
title |
Efficient inference offloading for mixture-of-experts large language models in internet of medical things |
title_short |
Efficient inference offloading for mixture-of-experts large language models in internet of medical things |
title_full |
Efficient inference offloading for mixture-of-experts large language models in internet of medical things |
title_fullStr |
Efficient inference offloading for mixture-of-experts large language models in internet of medical things |
title_full_unstemmed |
Efficient inference offloading for mixture-of-experts large language models in internet of medical things |
title_sort |
efficient inference offloading for mixture-of-experts large language models in internet of medical things |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/179743 |
_version_ |
1814047205580865536 |