Contextual human object interaction understanding from pre-trained large language model

Existing human object interaction (HOI) detection methods have introduced zero-shot learning techniques to recognize unseen interactions, but they still have limitations in understanding context information and comprehensive reasoning. To overcome these limitations, we propose a novel HOI learning f...

Full description

Saved in:

Bibliographic Details
Main Authors:	Gao ,Jianjun, Yap, Kim-Hui, Wu, Kejun, Phan, Duc Tri, Garg, Kratika, Han, Boon Siew
Other Authors:	School of Electrical and Electronic Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2025
Subjects:	Computer and Information Science Human object interaction Zero-shot learning
Online Access:	https://hdl.handle.net/10356/182095
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	Existing human object interaction (HOI) detection methods have introduced zero-shot learning techniques to recognize unseen interactions, but they still have limitations in understanding context information and comprehensive reasoning. To overcome these limitations, we propose a novel HOI learning framework, ContextHOI, which serves as an effective contextual HOI detector to enhance contextual understanding and zero-shot reasoning ability. The main contributions of the proposed ContextHOI are a novel context-mining decoder and a powerful interaction reasoning large language model (LLM). The context-mining decoder aims to extract linguistic contextual information from a pre-trained vision-language model. Based on the extracted context information, the proposed interaction reasoning LLM further enhances the zero-shot reasoning ability by leveraging rich linguistic knowledge. Extensive evaluation demonstrates that our proposed framework outperforms existing zero-shot methods on the HICO-DET and SWIG-HOI datasets, as high as 19.34% mAP on unseen interaction can be achieved.

Contextual human object interaction understanding from pre-trained large language model

Similar Items