CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference com...

Full description

Saved in:

Bibliographic Details
Main Authors:	HOU, Zhijian, ZHONG, Wanjun, JI, Lei, GAO, Difei, YAN, Kun, CHAN, Wing-Kwong, NGO, Chong-Wah, SHOU, Mike Z., DUAN, Nan.
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics
Online Access:	https://ink.library.smu.edu.sg/sis_research/8375 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9378
record_format	dspace
spelling	sg-smu-ink.sis_research-93782023-12-12T09:41:02Z CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan. This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. 2023-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8375 info:doi/10.18653/v1/2023.acl-long.445 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics
spellingShingle	Benchmarking Computational linguistics Natural language processing systems Artificial Intelligence and Robotics HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan. CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
description	This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.
format	text
author	HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan.
author_facet	HOU, Zhijian ZHONG, Wanjun JI, Lei GAO, Difei YAN, Kun CHAN, Wing-Kwong NGO, Chong-Wah SHOU, Mike Z. DUAN, Nan.
author_sort	HOU, Zhijian
title	CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_short	CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_full	CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_fullStr	CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_full_unstemmed	CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding
title_sort	cone: an efficient coarse-to-fine alignment framework for long video temporal grounding
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/8375 https://ink.library.smu.edu.sg/context/sis_research/article/9378/viewcontent/2023.acl_long.445.pdf
_version_	1787136845730021376

CONE: An efficient COarse-to-fiNE alignment framework for long video temporal grounding

Similar Items