NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model

Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects po...

Full description

Saved in:
Bibliographic Details
Main Authors: Lai, Yuzhi, Yuan, Shenghai, Nassar, Youssef, Fan, Mingyu, Weber, Thomas, Rätsch, Matthias
Other Authors: School of Electrical and Electronic Engineering
Format: Article
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182119
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-182119
record_format dspace
spelling sg-ntu-dr.10356-1821192025-01-10T15:43:54Z NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias School of Electrical and Electronic Engineering Centre for Advanced Robotics Technology Innovation (CARTIN) Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git. National Research Foundation (NRF) Submitted/Accepted version This work is supported by a grant of the EFRE and MWK ProFöR&D program, no. FEIH_ProT_2517820 and MWK32-7535-30/10/2. This work is also supported by National Research Foundation, Singapore, under its Medium-Sized Center for Advanced Robotics Technology Innovation. 2025-01-09T05:21:26Z 2025-01-09T05:21:26Z 2025 Journal Article Lai, Y., Yuan, S., Nassar, Y., Fan, M., Weber, T. & Rätsch, M. (2025). NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model. Expert Systems With Applications, 268, 126360-. https://dx.doi.org/10.1016/j.eswa.2024.126360 0957-4174 https://hdl.handle.net/10356/182119 10.1016/j.eswa.2024.126360 268 126360 en Expert Systems with Applications © 2025 Elsevier. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1016/j.eswa.2024.126360. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Human-robot interaction
Intent recognition
Multi-modality perception
Large language models
Unsupervised interaction
spellingShingle Computer and Information Science
Human-robot interaction
Intent recognition
Multi-modality perception
Large language models
Unsupervised interaction
Lai, Yuzhi
Yuan, Shenghai
Nassar, Youssef
Fan, Mingyu
Weber, Thomas
Rätsch, Matthias
NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
description Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.
author2 School of Electrical and Electronic Engineering
author_facet School of Electrical and Electronic Engineering
Lai, Yuzhi
Yuan, Shenghai
Nassar, Youssef
Fan, Mingyu
Weber, Thomas
Rätsch, Matthias
format Article
author Lai, Yuzhi
Yuan, Shenghai
Nassar, Youssef
Fan, Mingyu
Weber, Thomas
Rätsch, Matthias
author_sort Lai, Yuzhi
title NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_short NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_full NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_fullStr NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_full_unstemmed NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_sort nvp-hri: zero shot natural voice and posture-based human–robot interaction via large language model
publishDate 2025
url https://hdl.handle.net/10356/182119
_version_ 1821237118296391680