CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference com...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8375 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9378 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-93782023-12-12T09:41:02Z CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan. This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. 2023-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8375 info:doi/10.18653/v1/2023.acl-long.445 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics |
spellingShingle |
Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan. CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding |
description |
This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. |
format |
text |
author |
HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan. |
author_facet |
HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan. |
author_sort |
HOU, Zhijian |
title |
CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding |
title_short |
CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding |
title_full |
CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding |
title_fullStr |
CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding |
title_full_unstemmed |
CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding |
title_sort |
cone: an efficient coarse-to-fine alignment framework for long video temporal grounding |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/8375 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf |
_version_ |
1787136845730021376 |