Towards temporal sentence grounding in videos

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and...

Full description

Saved in:
Bibliographic Details
Main Author: Zhang, Hao
Other Authors: Sun Aixin
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/163788
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-163788
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computer applications
Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computer applications
Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Zhang, Hao
Towards temporal sentence grounding in videos
description Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. Successful retrieval of temporal moments enables machines to understand and organize multimodal information in a systematic manner. Different from humans who can quickly identify temporal moments, which is semantically related to a given language query, using their inference-making ability and commonsense knowledge, machines do not have such intelligence. The main challenge is that machines require to understand the semantics of both video and language query, as well as the precise cross-modal reasoning between them. As video and language query are different modalities, the recognition and localization of temporal moments greatly depend on machine understanding of the input contents and interactions between them. In this thesis, we introduce several novel approaches to tackle the TSGV problem from a new perspective. First, we propose to formulate TSGV as a span-based question answering (QA) task by treating the input video as a text passage. Then we devise a video span localizing network (VSLNet), on top of a typical span-based QA framework, to address TSGV by considering the differences between TSGV and span-based QA. The proposed method demonstrates that adopting a span-based QA framework is a promising direction to solve TSGV, and superior performance is obtained on several benchmark datasets. Second, despite the promising performance achieved by VSLNet, we observe existing solutions, including VSLNet, only perform well on short videos, but fail to generalize on long videos. To address the issue of performance degradation on long videos, we extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L splits the untrimmed video into short clip segments and predicts which clip segment contains the target moment and suppresses the importance of other segments. Experimental results show that VSLNet-L well addresses the issue of performance degradation on long videos. Third, when evaluation metric becomes strict, the results of TSGV methods drop significantly. That is, the predicted moment boundaries cannot well fit the ground truth. Based on VSLNet, we investigate a sequence matching approach, which incorporates the concepts of named entity recognition (NER) to remedy moment boundary prediction errors. We first analyze the relationships between TSGV and NER and reveal that the moment boundary prediction of TSGV is a generalized entity boundary detection problem. This insight leads us to equip a NER-style boundary detection module and develop a more effective and efficient TSGV algorithm. Fourth, we analyze the annotation distributional bias in widely used datasets for TSGV. Existence of such bias “hints” a model to capture the statistical regularities of moment annotations. To address this issue, we propose two debiasing strategies, i.e., data debiasing and model debiasing, on top of VSLNet to “force” a TSGV model to focus on cross-modal reasoning for precise moment retrieval. Experimental results show that both strategies are effective in improving model generalization capability and suppressing the effects of bias. Finally, we study the video corpus moment retrieval (VCMR) task, which aims to retrieve a temporal moment from a collection of untrimmed and unsegmented videos. VCMR is an extension of the TSGV task, but it is more practical since VCMR does not hold the strict hypothesis that a video-query pair must be given. In this task, we first study the characteristics of two general frameworks for VCMR, where one framework is of high efficiency but inferior retrieval performance, while the other is of better performance but low efficiency. We then propose a retrieval and localization network with contrastive learning to remedy the contradiction between the efficiency and accuracy of existing approaches. All in all, despite TSGV having been established and investigated for years, this thesis contributes several key ideas to solve TSGV from different perspectives, i.e., from the view of span-based QA and NER in NLP. Besides, we propose to address the annotation distributional bias of TSGV and extend it to a more practical scenario. Meanwhile, we also shed light on a few potential directions for future work.
author2 Sun Aixin
author_facet Sun Aixin
Zhang, Hao
format Thesis-Doctor of Philosophy
author Zhang, Hao
author_sort Zhang, Hao
title Towards temporal sentence grounding in videos
title_short Towards temporal sentence grounding in videos
title_full Towards temporal sentence grounding in videos
title_fullStr Towards temporal sentence grounding in videos
title_full_unstemmed Towards temporal sentence grounding in videos
title_sort towards temporal sentence grounding in videos
publisher Nanyang Technological University
publishDate 2022
url https://hdl.handle.net/10356/163788
_version_ 1754611274843422720
spelling sg-ntu-dr.10356-1637882023-01-03T05:05:24Z Towards temporal sentence grounding in videos Zhang, Hao Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computer applications Engineering::Computer science and engineering::Computing methodologies::Document and text processing Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. Successful retrieval of temporal moments enables machines to understand and organize multimodal information in a systematic manner. Different from humans who can quickly identify temporal moments, which is semantically related to a given language query, using their inference-making ability and commonsense knowledge, machines do not have such intelligence. The main challenge is that machines require to understand the semantics of both video and language query, as well as the precise cross-modal reasoning between them. As video and language query are different modalities, the recognition and localization of temporal moments greatly depend on machine understanding of the input contents and interactions between them. In this thesis, we introduce several novel approaches to tackle the TSGV problem from a new perspective. First, we propose to formulate TSGV as a span-based question answering (QA) task by treating the input video as a text passage. Then we devise a video span localizing network (VSLNet), on top of a typical span-based QA framework, to address TSGV by considering the differences between TSGV and span-based QA. The proposed method demonstrates that adopting a span-based QA framework is a promising direction to solve TSGV, and superior performance is obtained on several benchmark datasets. Second, despite the promising performance achieved by VSLNet, we observe existing solutions, including VSLNet, only perform well on short videos, but fail to generalize on long videos. To address the issue of performance degradation on long videos, we extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L splits the untrimmed video into short clip segments and predicts which clip segment contains the target moment and suppresses the importance of other segments. Experimental results show that VSLNet-L well addresses the issue of performance degradation on long videos. Third, when evaluation metric becomes strict, the results of TSGV methods drop significantly. That is, the predicted moment boundaries cannot well fit the ground truth. Based on VSLNet, we investigate a sequence matching approach, which incorporates the concepts of named entity recognition (NER) to remedy moment boundary prediction errors. We first analyze the relationships between TSGV and NER and reveal that the moment boundary prediction of TSGV is a generalized entity boundary detection problem. This insight leads us to equip a NER-style boundary detection module and develop a more effective and efficient TSGV algorithm. Fourth, we analyze the annotation distributional bias in widely used datasets for TSGV. Existence of such bias “hints” a model to capture the statistical regularities of moment annotations. To address this issue, we propose two debiasing strategies, i.e., data debiasing and model debiasing, on top of VSLNet to “force” a TSGV model to focus on cross-modal reasoning for precise moment retrieval. Experimental results show that both strategies are effective in improving model generalization capability and suppressing the effects of bias. Finally, we study the video corpus moment retrieval (VCMR) task, which aims to retrieve a temporal moment from a collection of untrimmed and unsegmented videos. VCMR is an extension of the TSGV task, but it is more practical since VCMR does not hold the strict hypothesis that a video-query pair must be given. In this task, we first study the characteristics of two general frameworks for VCMR, where one framework is of high efficiency but inferior retrieval performance, while the other is of better performance but low efficiency. We then propose a retrieval and localization network with contrastive learning to remedy the contradiction between the efficiency and accuracy of existing approaches. All in all, despite TSGV having been established and investigated for years, this thesis contributes several key ideas to solve TSGV from different perspectives, i.e., from the view of span-based QA and NER in NLP. Besides, we propose to address the annotation distributional bias of TSGV and extend it to a more practical scenario. Meanwhile, we also shed light on a few potential directions for future work. Doctor of Philosophy 2022-12-17T14:58:07Z 2022-12-17T14:58:07Z 2022 Thesis-Doctor of Philosophy Zhang, H. (2022). Towards temporal sentence grounding in videos. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/163788 https://hdl.handle.net/10356/163788 10.32657/10356/163788 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University