VrdONE : One-stage video visual relation detection

Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two p...

Full description

Saved in:

Bibliographic Details
Main Authors:	JIANG, Xinjie, ZHENG, Chenxi, XU, Xuemiao, LIU, Bangzhen, ZHENG, Weiying, ZHANG, Huaidong, HE, Shengfeng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Scene understanding Video relation detection Video understanding One-stage Set prediction Spatiotemporally synergism Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/9802
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10802
record_format	dspace
spelling	sg-smu-ink.sis_research-108022024-12-12T09:00:03Z VrdONE : One-stage video visual relation detection JIANG, Xinjie ZHENG, Chenxi XU, Xuemiao LIU, Bangzhen ZHENG, Weiying ZHANG, Huaidong HE, Shengfeng Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. 2024-10-28T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/9802 info:doi/10.1145/3664647.3680833 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Scene understanding Video relation detection Video understanding One-stage Set prediction Spatiotemporally synergism Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Scene understanding Video relation detection Video understanding One-stage Set prediction Spatiotemporally synergism Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
spellingShingle	Scene understanding Video relation detection Video understanding One-stage Set prediction Spatiotemporally synergism Artificial Intelligence and Robotics Graphics and Human Computer Interfaces JIANG, Xinjie ZHENG, Chenxi XU, Xuemiao LIU, Bangzhen ZHENG, Weiying ZHANG, Huaidong HE, Shengfeng VrdONE : One-stage video visual relation detection
description	Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales.
format	text
author	JIANG, Xinjie ZHENG, Chenxi XU, Xuemiao LIU, Bangzhen ZHENG, Weiying ZHANG, Huaidong HE, Shengfeng
author_facet	JIANG, Xinjie ZHENG, Chenxi XU, Xuemiao LIU, Bangzhen ZHENG, Weiying ZHANG, Huaidong HE, Shengfeng
author_sort	JIANG, Xinjie
title	VrdONE : One-stage video visual relation detection
title_short	VrdONE : One-stage video visual relation detection
title_full	VrdONE : One-stage video visual relation detection
title_fullStr	VrdONE : One-stage video visual relation detection
title_full_unstemmed	VrdONE : One-stage video visual relation detection
title_sort	vrdone : one-stage video visual relation detection
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9802
_version_	1819113142661152768

VrdONE : One-stage video visual relation detection

Similar Items