NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model

Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects po...

Full description

Saved in:
Bibliographic Details
Main Authors: Lai, Yuzhi, Yuan, Shenghai, Nassar, Youssef, Fan, Mingyu, Weber, Thomas, Rätsch, Matthias
Other Authors: School of Electrical and Electronic Engineering
Format: Article
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182119
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.