NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model

Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects po...

Full description

Saved in:

Bibliographic Details
Main Authors:	Lai, Yuzhi, Yuan, Shenghai, Nassar, Youssef, Fan, Mingyu, Weber, Thomas, Rätsch, Matthias
Other Authors:	School of Electrical and Electronic Engineering
Format:	Article
Language:	English
Published:	2025
Subjects:	Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction
Online Access:	https://hdl.handle.net/10356/182119
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-182119
record_format	dspace
spelling	sg-ntu-dr.10356-1821192025-01-10T15:43:54Z NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias School of Electrical and Electronic Engineering Centre for Advanced Robotics Technology Innovation (CARTIN) Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git. National Research Foundation (NRF) Submitted/Accepted version This work is supported by a grant of the EFRE and MWK ProFöR&D program, no. FEIH_ProT_2517820 and MWK32-7535-30/10/2. This work is also supported by National Research Foundation, Singapore, under its Medium-Sized Center for Advanced Robotics Technology Innovation. 2025-01-09T05:21:26Z 2025-01-09T05:21:26Z 2025 Journal Article Lai, Y., Yuan, S., Nassar, Y., Fan, M., Weber, T. & Rätsch, M. (2025). NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model. Expert Systems With Applications, 268, 126360-. https://dx.doi.org/10.1016/j.eswa.2024.126360 0957-4174 https://hdl.handle.net/10356/182119 10.1016/j.eswa.2024.126360 268 126360 en Expert Systems with Applications © 2025 Elsevier. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1016/j.eswa.2024.126360. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction
spellingShingle	Computer and Information Science Human-robot interaction Intent recognition Multi-modality perception Large language models Unsupervised interaction Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
description	Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias
format	Article
author	Lai, Yuzhi Yuan, Shenghai Nassar, Youssef Fan, Mingyu Weber, Thomas Rätsch, Matthias
author_sort	Lai, Yuzhi
title	NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_short	NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_full	NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_fullStr	NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_full_unstemmed	NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
title_sort	nvp-hri: zero shot natural voice and posture-based human–robot interaction via large language model
publishDate	2025
url	https://hdl.handle.net/10356/182119
_version_	1821237118296391680

NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model

Similar Items