CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference com...

Full description

Saved in:

Bibliographic Details
Main Authors:	HOU, Zhijian, ZHONG, Wanjun, JI, Lei, GAO, Difei, YAN, Kun, CHAN, Wing-Kwong, NGO, Chong-Wah, SHOU, Mike Z., DUAN, Nan.
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics
Online Access:	https://ink.library.smu.edu.sg/sis_research/8375 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Description
Summary:	This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.

CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding

Similar Items