Triadic temporal-semantic alignment for weakly-supervised video moment retrieval

Video Moment Retrieval (VMR) aims to identify specific event moments within untrimmed videos based on natural language queries. Existing VMR methods have been criticized for relying heavily on moment annotation bias rather than true multi-modal alignment reasoning. Weakly supervised VMR approaches i...

Full description

Saved in:

Bibliographic Details
Main Authors:	LIU, Jin, XIE, JiaLong, ZHOU, Fengyu, HE, Shengfeng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Weakly supervised learning Video moment retrieval Temporal-semantic alignment Graphics and Human Computer Interfaces Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/9286 https://ink.library.smu.edu.sg/context/sis_research/article/10286/viewcontent/ssrn_4726553.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10286
record_format	dspace
spelling	sg-smu-ink.sis_research-102862024-09-13T14:38:04Z Triadic temporal-semantic alignment for weakly-supervised video moment retrieval LIU, Jin XIE, JiaLong ZHOU, Fengyu HE, Shengfeng Video Moment Retrieval (VMR) aims to identify specific event moments within untrimmed videos based on natural language queries. Existing VMR methods have been criticized for relying heavily on moment annotation bias rather than true multi-modal alignment reasoning. Weakly supervised VMR approaches inherently overcome this issue by training without precise temporal location information. However, they struggle with fine-grained semantic alignment and often yield multiple speculative predictions with prolonged video spans. In this paper, we take a step forward in the context of weakly supervised VMR by proposing a triadic temporalsemantic alignment model. Our proposed approach augments weak supervision by comprehensively addressing the multi-modal semantic alignment between query sentences and videos from both fine-grained and coarsegrained perspectives. To capture fine-grained cross-modal semantic correlations, we introduce a concept-aspect alignment strategy that leverages nouns to select relevant video clips. Additionally, an action-aspect alignment strategy with verbs is employed to capture temporal information. Furthermore, we propose an event-aspect alignment strategy that focuses on event information within coarse-grained video clips, thus mitigating the tendency towards long video span predictions during coarse-grained cross-modal semantic alignment. Extensive experiments conducted on the Charades-CD and ActivityNet-CD datasets demonstrate the superior performance of our proposed method. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9286 info:doi/10.1016/j.patcog.2024.110819 https://ink.library.smu.edu.sg/context/sis_research/article/10286/viewcontent/ssrn_4726553.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Weakly supervised learning Video moment retrieval Temporal-semantic alignment Graphics and Human Computer Interfaces Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Weakly supervised learning Video moment retrieval Temporal-semantic alignment Graphics and Human Computer Interfaces Software Engineering
spellingShingle	Weakly supervised learning Video moment retrieval Temporal-semantic alignment Graphics and Human Computer Interfaces Software Engineering LIU, Jin XIE, JiaLong ZHOU, Fengyu HE, Shengfeng Triadic temporal-semantic alignment for weakly-supervised video moment retrieval
description	Video Moment Retrieval (VMR) aims to identify specific event moments within untrimmed videos based on natural language queries. Existing VMR methods have been criticized for relying heavily on moment annotation bias rather than true multi-modal alignment reasoning. Weakly supervised VMR approaches inherently overcome this issue by training without precise temporal location information. However, they struggle with fine-grained semantic alignment and often yield multiple speculative predictions with prolonged video spans. In this paper, we take a step forward in the context of weakly supervised VMR by proposing a triadic temporalsemantic alignment model. Our proposed approach augments weak supervision by comprehensively addressing the multi-modal semantic alignment between query sentences and videos from both fine-grained and coarsegrained perspectives. To capture fine-grained cross-modal semantic correlations, we introduce a concept-aspect alignment strategy that leverages nouns to select relevant video clips. Additionally, an action-aspect alignment strategy with verbs is employed to capture temporal information. Furthermore, we propose an event-aspect alignment strategy that focuses on event information within coarse-grained video clips, thus mitigating the tendency towards long video span predictions during coarse-grained cross-modal semantic alignment. Extensive experiments conducted on the Charades-CD and ActivityNet-CD datasets demonstrate the superior performance of our proposed method.
format	text
author	LIU, Jin XIE, JiaLong ZHOU, Fengyu HE, Shengfeng
author_facet	LIU, Jin XIE, JiaLong ZHOU, Fengyu HE, Shengfeng
author_sort	LIU, Jin
title	Triadic temporal-semantic alignment for weakly-supervised video moment retrieval
title_short	Triadic temporal-semantic alignment for weakly-supervised video moment retrieval
title_full	Triadic temporal-semantic alignment for weakly-supervised video moment retrieval
title_fullStr	Triadic temporal-semantic alignment for weakly-supervised video moment retrieval
title_full_unstemmed	Triadic temporal-semantic alignment for weakly-supervised video moment retrieval
title_sort	triadic temporal-semantic alignment for weakly-supervised video moment retrieval
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9286 https://ink.library.smu.edu.sg/context/sis_research/article/10286/viewcontent/ssrn_4726553.pdf
_version_	1814047873083375616

Triadic temporal-semantic alignment for weakly-supervised video moment retrieval

Similar Items