Towards semantic, debiased and moment video retrieval

Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a...

Full description

Saved in:

Bibliographic Details
Main Author:	Satar, Burak
Other Authors:	-
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2025
Subjects:	Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning
Online Access:	https://hdl.handle.net/10356/182104
Tags:	Add Tag No Tags, Be the first to tag this record!

id	sg-ntu-dr.10356-182104
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning
spellingShingle	Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning Satar, Burak Towards semantic, debiased and moment video retrieval
description	Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a moment from a video corpus corresponding to a text query is even more challenging, especially for long egocentric videos, considering the need for fine-grained cross-modal reasoning. In this thesis, we address these problems to build machines intelligent enough towards semantic, debiased, and moment video retrieval. We first address a crucial task, semantic video retrieval, given that voluminous video clips are uploaded daily. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. We propose a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at three hierarchical levels. The results indicate that our method outperforms the state-of-the-art methods, given the same visual backbone without pre-training. Besides, we conducted extensive ablation studies to elucidate our design choices. Finally, we also extend our method with various improvements in design choices and prove its superiority in competition. However, these improvements can still be prone to various biases, causing models to learn spurious correlations. Thus, we focus on a debiasing model to address a temporal bias specific to video retrieval tasks. While many studies focus on improving pre-training or developing new backbones, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, temporal object co-occurrences on video scene graph generation could induce spurious correlations. We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips as the first attempt to address a temporal bias in text-video retrieval tasks. We foremost hypothesise and verify the bias with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on three different datasets. Our model overpasses the baseline over +2.5 points on nDCG, a semantic-relevancy-focused evaluation metric, mitigating the bias. However, longer videos necessitate users to retrieve specific moments within videos. Therefore, we address the Video Corpus Moment Retrieval (VCMR) task to retrieve specific moments from extensive video corpora. Our approach tackles this challenge for the first time on long, fine-grained, and untrimmed egocentric videos, while existing methodologies target short, coarse-grained, and trimmed third-person videos. This presents a formidable challenge as target moments lack textual information either from speech or subtitles and are much shorter amidst longer video sequences accompanied by shorter narrations. Our approach involves captioning moments over different timestamps using an off-the-shelf tool, enriching the captions with an LLM, and combining them with audio features within the corresponding long video to exploit the sound of object interactions. We establish three strong baselines incorporating the same additional multimodal features for a fair comparison. In two different architectural model designs, we demonstrate a 10\% to 86\% increase in summing the Recall metric over various IoUs compared to the baseline methods. In conclusion, this thesis contributes several vital ideas from different perspectives, i.e., a novel transformer for semantic video retrieval to a causal inference method for debiased video retrieval. Besides, we leverage LLM and audio fusion to address moment retrieval in video corpus for long egocentric videos. Last but not least, we also shed light on future work directions to improve the models' capability.
author2	-
author_facet	- Satar, Burak
format	Thesis-Doctor of Philosophy
author	Satar, Burak
author_sort	Satar, Burak
title	Towards semantic, debiased and moment video retrieval
title_short	Towards semantic, debiased and moment video retrieval
title_full	Towards semantic, debiased and moment video retrieval
title_fullStr	Towards semantic, debiased and moment video retrieval
title_full_unstemmed	Towards semantic, debiased and moment video retrieval
title_sort	towards semantic, debiased and moment video retrieval
publisher	Nanyang Technological University
publishDate	2025
url	https://hdl.handle.net/10356/182104
_version_	1823807373024690176
spelling	sg-ntu-dr.10356-1821042025-02-05T01:58:52Z Towards semantic, debiased and moment video retrieval Satar, Burak - College of Computing and Data Science Institute for Infocomm Research (I2R), A*STAR Lim Joo Hwee joohwee_lim@ntu.edu.sg Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a moment from a video corpus corresponding to a text query is even more challenging, especially for long egocentric videos, considering the need for fine-grained cross-modal reasoning. In this thesis, we address these problems to build machines intelligent enough towards semantic, debiased, and moment video retrieval. We first address a crucial task, semantic video retrieval, given that voluminous video clips are uploaded daily. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. We propose a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at three hierarchical levels. The results indicate that our method outperforms the state-of-the-art methods, given the same visual backbone without pre-training. Besides, we conducted extensive ablation studies to elucidate our design choices. Finally, we also extend our method with various improvements in design choices and prove its superiority in competition. However, these improvements can still be prone to various biases, causing models to learn spurious correlations. Thus, we focus on a debiasing model to address a temporal bias specific to video retrieval tasks. While many studies focus on improving pre-training or developing new backbones, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, temporal object co-occurrences on video scene graph generation could induce spurious correlations. We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips as the first attempt to address a temporal bias in text-video retrieval tasks. We foremost hypothesise and verify the bias with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on three different datasets. Our model overpasses the baseline over +2.5 points on nDCG, a semantic-relevancy-focused evaluation metric, mitigating the bias. However, longer videos necessitate users to retrieve specific moments within videos. Therefore, we address the Video Corpus Moment Retrieval (VCMR) task to retrieve specific moments from extensive video corpora. Our approach tackles this challenge for the first time on long, fine-grained, and untrimmed egocentric videos, while existing methodologies target short, coarse-grained, and trimmed third-person videos. This presents a formidable challenge as target moments lack textual information either from speech or subtitles and are much shorter amidst longer video sequences accompanied by shorter narrations. Our approach involves captioning moments over different timestamps using an off-the-shelf tool, enriching the captions with an LLM, and combining them with audio features within the corresponding long video to exploit the sound of object interactions. We establish three strong baselines incorporating the same additional multimodal features for a fair comparison. In two different architectural model designs, we demonstrate a 10\% to 86\% increase in summing the Recall metric over various IoUs compared to the baseline methods. In conclusion, this thesis contributes several vital ideas from different perspectives, i.e., a novel transformer for semantic video retrieval to a causal inference method for debiased video retrieval. Besides, we leverage LLM and audio fusion to address moment retrieval in video corpus for long egocentric videos. Last but not least, we also shed light on future work directions to improve the models' capability. Doctor of Philosophy 2025-01-09T04:40:29Z 2025-01-09T04:40:29Z 2025 Thesis-Doctor of Philosophy Satar, B. (2025). Towards semantic, debiased and moment video retrieval. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/182104 https://hdl.handle.net/10356/182104 10.32657/10356/182104 en A18A2b0046 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Towards semantic, debiased and moment video retrieval

Similar Items