CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference com...

Full description

Saved in:
Bibliographic Details
Main Authors: HOU, Zhijian, ZHONG, Wanjun, JI, Lei, GAO, Difei, YAN, Kun, CHAN, Wing-Kwong, NGO, Chong-Wah, SHOU, Mike Z., DUAN, Nan.
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8375
https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9378
record_format dspace
spelling sg-smu-ink.sis_research-93782023-12-12T09:41:02Z CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan. This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. 2023-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8375 info:doi/10.18653/v1/2023.acl-long.445 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Benchmarking
Computational linguistics
Natural language processing systems
Artificial Intelligence and Robotics
spellingShingle Benchmarking
Computational linguistics
Natural language processing systems
Artificial Intelligence and Robotics
HOU, Zhijian
ZHONG, Wanjun
JI, Lei
GAO, Difei
YAN, Kun
CHAN, Wing-Kwong
NGO, Chong-Wah
SHOU, Mike Z.
DUAN, Nan.
CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
description This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.
format text
author HOU, Zhijian
ZHONG, Wanjun
JI, Lei
GAO, Difei
YAN, Kun
CHAN, Wing-Kwong
NGO, Chong-Wah
SHOU, Mike Z.
DUAN, Nan.
author_facet HOU, Zhijian
ZHONG, Wanjun
JI, Lei
GAO, Difei
YAN, Kun
CHAN, Wing-Kwong
NGO, Chong-Wah
SHOU, Mike Z.
DUAN, Nan.
author_sort HOU, Zhijian
title CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_short CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_full CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_fullStr CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_full_unstemmed CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_sort cone: an efficient coarse-to-fine alignment framework for long video temporal grounding
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8375
https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf
_version_ 1787136845730021376