Towards semantic, debiased and moment video retrieval

Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a...

Full description

Saved in:
Bibliographic Details
Main Author: Satar, Burak
Other Authors: -
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182104
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-182104
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Video retrieval
Video corpus moment retrieval
Cross modal retrieval
Long egocentric videos
Multimodal fusion with LLM and audio
Semantic and hierarchical video retrieval
Causal inference for temporal bias
Cross-modal representation learning
spellingShingle Computer and Information Science
Video retrieval
Video corpus moment retrieval
Cross modal retrieval
Long egocentric videos
Multimodal fusion with LLM and audio
Semantic and hierarchical video retrieval
Causal inference for temporal bias
Cross-modal representation learning
Satar, Burak
Towards semantic, debiased and moment video retrieval
description Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a moment from a video corpus corresponding to a text query is even more challenging, especially for long egocentric videos, considering the need for fine-grained cross-modal reasoning. In this thesis, we address these problems to build machines intelligent enough towards semantic, debiased, and moment video retrieval. We first address a crucial task, semantic video retrieval, given that voluminous video clips are uploaded daily. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. We propose a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at three hierarchical levels. The results indicate that our method outperforms the state-of-the-art methods, given the same visual backbone without pre-training. Besides, we conducted extensive ablation studies to elucidate our design choices. Finally, we also extend our method with various improvements in design choices and prove its superiority in competition. However, these improvements can still be prone to various biases, causing models to learn spurious correlations. Thus, we focus on a debiasing model to address a temporal bias specific to video retrieval tasks. While many studies focus on improving pre-training or developing new backbones, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, temporal object co-occurrences on video scene graph generation could induce spurious correlations. We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips as the first attempt to address a temporal bias in text-video retrieval tasks. We foremost hypothesise and verify the bias with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on three different datasets. Our model overpasses the baseline over +2.5 points on nDCG, a semantic-relevancy-focused evaluation metric, mitigating the bias. However, longer videos necessitate users to retrieve specific moments within videos. Therefore, we address the Video Corpus Moment Retrieval (VCMR) task to retrieve specific moments from extensive video corpora. Our approach tackles this challenge for the first time on long, fine-grained, and untrimmed egocentric videos, while existing methodologies target short, coarse-grained, and trimmed third-person videos. This presents a formidable challenge as target moments lack textual information either from speech or subtitles and are much shorter amidst longer video sequences accompanied by shorter narrations. Our approach involves captioning moments over different timestamps using an off-the-shelf tool, enriching the captions with an LLM, and combining them with audio features within the corresponding long video to exploit the sound of object interactions. We establish three strong baselines incorporating the same additional multimodal features for a fair comparison. In two different architectural model designs, we demonstrate a 10\% to 86\% increase in summing the Recall metric over various IoUs compared to the baseline methods. In conclusion, this thesis contributes several vital ideas from different perspectives, i.e., a novel transformer for semantic video retrieval to a causal inference method for debiased video retrieval. Besides, we leverage LLM and audio fusion to address moment retrieval in video corpus for long egocentric videos. Last but not least, we also shed light on future work directions to improve the models' capability.
author2 -
author_facet -
Satar, Burak
format Thesis-Doctor of Philosophy
author Satar, Burak
author_sort Satar, Burak
title Towards semantic, debiased and moment video retrieval
title_short Towards semantic, debiased and moment video retrieval
title_full Towards semantic, debiased and moment video retrieval
title_fullStr Towards semantic, debiased and moment video retrieval
title_full_unstemmed Towards semantic, debiased and moment video retrieval
title_sort towards semantic, debiased and moment video retrieval
publisher Nanyang Technological University
publishDate 2025
url https://hdl.handle.net/10356/182104
_version_ 1821237189367824384
spelling sg-ntu-dr.10356-1821042025-01-09T04:40:30Z Towards semantic, debiased and moment video retrieval Satar, Burak - College of Computing and Data Science Institute for Infocomm Research (I2R), A*STAR Lim Joo Hwee joohwee_lim@ntu.edu.sg Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a moment from a video corpus corresponding to a text query is even more challenging, especially for long egocentric videos, considering the need for fine-grained cross-modal reasoning. In this thesis, we address these problems to build machines intelligent enough towards semantic, debiased, and moment video retrieval. We first address a crucial task, semantic video retrieval, given that voluminous video clips are uploaded daily. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. We propose a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at three hierarchical levels. The results indicate that our method outperforms the state-of-the-art methods, given the same visual backbone without pre-training. Besides, we conducted extensive ablation studies to elucidate our design choices. Finally, we also extend our method with various improvements in design choices and prove its superiority in competition. However, these improvements can still be prone to various biases, causing models to learn spurious correlations. Thus, we focus on a debiasing model to address a temporal bias specific to video retrieval tasks. While many studies focus on improving pre-training or developing new backbones, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, temporal object co-occurrences on video scene graph generation could induce spurious correlations. We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips as the first attempt to address a temporal bias in text-video retrieval tasks. We foremost hypothesise and verify the bias with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on three different datasets. Our model overpasses the baseline over +2.5 points on nDCG, a semantic-relevancy-focused evaluation metric, mitigating the bias. However, longer videos necessitate users to retrieve specific moments within videos. Therefore, we address the Video Corpus Moment Retrieval (VCMR) task to retrieve specific moments from extensive video corpora. Our approach tackles this challenge for the first time on long, fine-grained, and untrimmed egocentric videos, while existing methodologies target short, coarse-grained, and trimmed third-person videos. This presents a formidable challenge as target moments lack textual information either from speech or subtitles and are much shorter amidst longer video sequences accompanied by shorter narrations. Our approach involves captioning moments over different timestamps using an off-the-shelf tool, enriching the captions with an LLM, and combining them with audio features within the corresponding long video to exploit the sound of object interactions. We establish three strong baselines incorporating the same additional multimodal features for a fair comparison. In two different architectural model designs, we demonstrate a 10\% to 86\% increase in summing the Recall metric over various IoUs compared to the baseline methods. In conclusion, this thesis contributes several vital ideas from different perspectives, i.e., a novel transformer for semantic video retrieval to a causal inference method for debiased video retrieval. Besides, we leverage LLM and audio fusion to address moment retrieval in video corpus for long egocentric videos. Last but not least, we also shed light on future work directions to improve the models' capability. Doctor of Philosophy 2025-01-09T04:40:29Z 2025-01-09T04:40:29Z 2025 Thesis-Doctor of Philosophy Satar, B. (2025). Towards semantic, debiased and moment video retrieval. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/182104 https://hdl.handle.net/10356/182104 en A18A2b0046 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University