GroundNLQ @ Ego4D natural language queries challenge 2023

In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-traini...

Full description

Saved in:
Bibliographic Details
Main Authors: HOU, Zhijian, JI, Lei, GAO, Difei, ZHONG, Wanjun, YAN, Kun, NGO, Chong-wah, CHAN, Wing-Kwong, NGO, Chong-Wah, DUAN, Nan, SHOU, Mike Zheng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8416
https://ink.library.smu.edu.sg/context/sis_research/article/9419/viewcontent/2306.15255.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multiscale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at https://github. com/houzhijian/GroundNLQ.