Contextual human object interaction understanding from pre-trained large language model

Existing human object interaction (HOI) detection methods have introduced zero-shot learning techniques to recognize unseen interactions, but they still have limitations in understanding context information and comprehensive reasoning. To overcome these limitations, we propose a novel HOI learning f...

Full description

Saved in:

Bibliographic Details
Main Authors:	Gao ,Jianjun, Yap, Kim-Hui, Wu, Kejun, Phan, Duc Tri, Garg, Kratika, Han, Boon Siew
Other Authors:	School of Electrical and Electronic Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2025
Subjects:	Computer and Information Science Human object interaction Zero-shot learning
Online Access:	https://hdl.handle.net/10356/182095
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-182095
record_format	dspace
spelling	sg-ntu-dr.10356-1820952025-01-10T15:42:28Z Contextual human object interaction understanding from pre-trained large language model Gao ,Jianjun Yap, Kim-Hui Wu, Kejun Phan, Duc Tri Garg, Kratika Han, Boon Siew School of Electrical and Electronic Engineering 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Schaeffler Hub for Advanced REsearch (SHARE) Lab Computer and Information Science Human object interaction Zero-shot learning Existing human object interaction (HOI) detection methods have introduced zero-shot learning techniques to recognize unseen interactions, but they still have limitations in understanding context information and comprehensive reasoning. To overcome these limitations, we propose a novel HOI learning framework, ContextHOI, which serves as an effective contextual HOI detector to enhance contextual understanding and zero-shot reasoning ability. The main contributions of the proposed ContextHOI are a novel context-mining decoder and a powerful interaction reasoning large language model (LLM). The context-mining decoder aims to extract linguistic contextual information from a pre-trained vision-language model. Based on the extracted context information, the proposed interaction reasoning LLM further enhances the zero-shot reasoning ability by leveraging rich linguistic knowledge. Extensive evaluation demonstrates that our proposed framework outperforms existing zero-shot methods on the HICO-DET and SWIG-HOI datasets, as high as 19.34% mAP on unseen interaction can be achieved. Agency for Science, Technology and Research (ASTAR) Submitted/Accepted version This research is supported by the Agency for Science, Technology and Research (ASTAR) under its IAF-ICP Programme I2001E0067 and the Schaeffler Hub for Advanced Research at NTU. 2025-01-09T06:29:48Z 2025-01-09T06:29:48Z 2024 Conference Paper Gao , J., Yap, K., Wu, K., Phan, D. T., Garg, K. & Han, B. S. (2024). Contextual human object interaction understanding from pre-trained large language model. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 13436-13440. https://dx.doi.org/10.1109/ICASSP48485.2024.10447511 9798350344851 https://hdl.handle.net/10356/182095 10.1109/ICASSP48485.2024.10447511 2-s2.0-85195374190 13436 13440 en I2001E0067 © 2024 IEEE. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1109/ICASSP48485.2024.10447511. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Human object interaction Zero-shot learning
spellingShingle	Computer and Information Science Human object interaction Zero-shot learning Gao ,Jianjun Yap, Kim-Hui Wu, Kejun Phan, Duc Tri Garg, Kratika Han, Boon Siew Contextual human object interaction understanding from pre-trained large language model
description	Existing human object interaction (HOI) detection methods have introduced zero-shot learning techniques to recognize unseen interactions, but they still have limitations in understanding context information and comprehensive reasoning. To overcome these limitations, we propose a novel HOI learning framework, ContextHOI, which serves as an effective contextual HOI detector to enhance contextual understanding and zero-shot reasoning ability. The main contributions of the proposed ContextHOI are a novel context-mining decoder and a powerful interaction reasoning large language model (LLM). The context-mining decoder aims to extract linguistic contextual information from a pre-trained vision-language model. Based on the extracted context information, the proposed interaction reasoning LLM further enhances the zero-shot reasoning ability by leveraging rich linguistic knowledge. Extensive evaluation demonstrates that our proposed framework outperforms existing zero-shot methods on the HICO-DET and SWIG-HOI datasets, as high as 19.34% mAP on unseen interaction can be achieved.
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Gao ,Jianjun Yap, Kim-Hui Wu, Kejun Phan, Duc Tri Garg, Kratika Han, Boon Siew
format	Conference or Workshop Item
author	Gao ,Jianjun Yap, Kim-Hui Wu, Kejun Phan, Duc Tri Garg, Kratika Han, Boon Siew
author_sort	Gao ,Jianjun
title	Contextual human object interaction understanding from pre-trained large language model
title_short	Contextual human object interaction understanding from pre-trained large language model
title_full	Contextual human object interaction understanding from pre-trained large language model
title_fullStr	Contextual human object interaction understanding from pre-trained large language model
title_full_unstemmed	Contextual human object interaction understanding from pre-trained large language model
title_sort	contextual human object interaction understanding from pre-trained large language model
publishDate	2025
url	https://hdl.handle.net/10356/182095
_version_	1821237116362817536

Contextual human object interaction understanding from pre-trained large language model

Similar Items