Constructing holistic spatio-temporal scene graph for video semantic role labeling

As one of the core video semantic understanding tasks, Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL,...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	ZHAO, Yu, FEI, Hao, CAO, Yixin, LI, Bobo, ZHANG, Meishan, WEI, Jianguo, ZHANG, Min, CHUA, Tat-Seng
التنسيق:	text
اللغة:	English
منشور في:	Institutional Knowledge at Singapore Management University 2023
الموضوعات:	video understanding semantics role labeling event extraction scene graph Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing
الوصول للمادة أونلاين:	https://ink.library.smu.edu.sg/sis_research/8290 https://ink.library.smu.edu.sg/context/sis_research/article/9293/viewcontent/2308.05081.pdf
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Singapore Management University
اللغة:	English

id	sg-smu-ink.sis_research-9293
record_format	dspace
spelling	sg-smu-ink.sis_research-92932023-12-20T03:02:37Z Constructing holistic spatio-temporal scene graph for video semantic role labeling ZHAO, Yu FEI, Hao CAO, Yixin LI, Bobo ZHANG, Meishan WEI, Jianguo ZHANG, Min CHUA, Tat-Seng As one of the core video semantic understanding tasks, Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, e.g., filtering noisy branches and newly building informative connections, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods. Our HostSG representation shows greater potential to facilitate a broader range of other video understanding tasks. 2023-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8290 info:doi/10.1145/3581783.3612096 https://ink.library.smu.edu.sg/context/sis_research/article/9293/viewcontent/2308.05081.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University video understanding semantics role labeling event extraction scene graph Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	video understanding semantics role labeling event extraction scene graph Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing
spellingShingle	video understanding semantics role labeling event extraction scene graph Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing ZHAO, Yu FEI, Hao CAO, Yixin LI, Bobo ZHANG, Meishan WEI, Jianguo ZHANG, Min CHUA, Tat-Seng Constructing holistic spatio-temporal scene graph for video semantic role labeling
description	As one of the core video semantic understanding tasks, Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, e.g., filtering noisy branches and newly building informative connections, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods. Our HostSG representation shows greater potential to facilitate a broader range of other video understanding tasks.
format	text
author	ZHAO, Yu FEI, Hao CAO, Yixin LI, Bobo ZHANG, Meishan WEI, Jianguo ZHANG, Min CHUA, Tat-Seng
author_facet	ZHAO, Yu FEI, Hao CAO, Yixin LI, Bobo ZHANG, Meishan WEI, Jianguo ZHANG, Min CHUA, Tat-Seng
author_sort	ZHAO, Yu
title	Constructing holistic spatio-temporal scene graph for video semantic role labeling
title_short	Constructing holistic spatio-temporal scene graph for video semantic role labeling
title_full	Constructing holistic spatio-temporal scene graph for video semantic role labeling
title_fullStr	Constructing holistic spatio-temporal scene graph for video semantic role labeling
title_full_unstemmed	Constructing holistic spatio-temporal scene graph for video semantic role labeling
title_sort	constructing holistic spatio-temporal scene graph for video semantic role labeling
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/8290 https://ink.library.smu.edu.sg/context/sis_research/article/9293/viewcontent/2308.05081.pdf
_version_	1787136836407132160

Constructing holistic spatio-temporal scene graph for video semantic role labeling

مواد مشابهة