NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects po...
Saved in:
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/182119 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-182119 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1821192025-01-10T15:43:54Z NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias School of Electrical and Electronic Engineering Centre for Advanced Robotics Technology Innovation (CARTIN) Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git. National Research Foundation (NRF) Submitted/Accepted version This work is supported by a grant of the EFRE and MWK ProFöR&D program, no. FEIH_ProT_2517820 and MWK32-7535-30/10/2. This work is also supported by National Research Foundation, Singapore, under its Medium-Sized Center for Advanced Robotics Technology Innovation. 2025-01-09T05:21:26Z 2025-01-09T05:21:26Z 2025 Journal Article Lai, Y., Yuan, S., Nassar, Y., Fan, M., Weber, T. & Rätsch, M. (2025). NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model. Expert Systems With Applications, 268, 126360-. https://dx.doi.org/10.1016/j.eswa.2024.126360 0957-4174 https://hdl.handle.net/10356/182119 10.1016/j.eswa.2024.126360 268 126360 en Expert Systems with Applications © 2025 Elsevier. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1016/j.eswa.2024.126360. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction |
spellingShingle |
Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model |
description |
Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names.
This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge.
NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git. |
author2 |
School of Electrical and Electronic Engineering |
author_facet |
School of Electrical and Electronic Engineering Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias |
format |
Article |
author |
Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias |
author_sort |
Lai, Yuzhi |
title |
NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model |
title_short |
NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model |
title_full |
NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model |
title_fullStr |
NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model |
title_full_unstemmed |
NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model |
title_sort |
nvp-hri: zero shot natural voice and posture-based human–robot interaction via large language model |
publishDate |
2025 |
url |
https://hdl.handle.net/10356/182119 |
_version_ |
1821237118296391680 |