Towards semantic, debiased and moment video retrieval
Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/182104 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-182104 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning |
spellingShingle |
Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning Satar, Burak Towards semantic, debiased and moment video retrieval |
description |
Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a moment from a video corpus corresponding to a text query is even more challenging, especially for long egocentric videos, considering the need for fine-grained cross-modal reasoning. In this thesis, we address these problems to build machines intelligent enough towards semantic, debiased, and moment video retrieval.
We first address a crucial task, semantic video retrieval, given that voluminous video clips are uploaded daily. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. We propose a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at three hierarchical levels. The results indicate that our method outperforms the state-of-the-art methods, given the same visual backbone without pre-training. Besides, we conducted extensive ablation studies to elucidate our design choices. Finally, we also extend our method with various improvements in design choices and prove its superiority in competition.
However, these improvements can still be prone to various biases, causing models to learn spurious correlations.
Thus, we focus on a debiasing model to address a temporal bias specific to video retrieval tasks. While many studies focus on improving pre-training or developing new backbones, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, temporal object co-occurrences on video scene graph generation could induce spurious correlations. We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips as the first attempt to address a temporal bias in text-video retrieval tasks. We foremost hypothesise and verify the bias with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on three different datasets. Our model overpasses the baseline over +2.5 points on nDCG, a semantic-relevancy-focused evaluation metric, mitigating the bias.
However, longer videos necessitate users to retrieve specific moments within videos.
Therefore, we address the Video Corpus Moment Retrieval (VCMR) task to retrieve specific moments from extensive video corpora. Our approach tackles this challenge for the first time on long, fine-grained, and untrimmed egocentric videos, while existing methodologies target short, coarse-grained, and trimmed third-person videos. This presents a formidable challenge as target moments lack textual information either from speech or subtitles and are much shorter amidst longer video sequences accompanied by shorter narrations. Our approach involves captioning moments over different timestamps using an off-the-shelf tool, enriching the captions with an LLM, and combining them with audio features within the corresponding long video to exploit the sound of object interactions. We establish three strong baselines incorporating the same additional multimodal features for a fair comparison. In two different architectural model designs, we demonstrate a 10\% to 86\% increase in summing the Recall metric over various IoUs compared to the baseline methods.
In conclusion, this thesis contributes several vital ideas from different perspectives, i.e., a novel transformer for semantic video retrieval to a causal inference method for debiased video retrieval. Besides, we leverage LLM and audio fusion to address moment retrieval in video corpus for long egocentric videos. Last but not least, we also shed light on future work directions to improve the models' capability. |
author2 |
- |
author_facet |
- Satar, Burak |
format |
Thesis-Doctor of Philosophy |
author |
Satar, Burak |
author_sort |
Satar, Burak |
title |
Towards semantic, debiased and moment video retrieval |
title_short |
Towards semantic, debiased and moment video retrieval |
title_full |
Towards semantic, debiased and moment video retrieval |
title_fullStr |
Towards semantic, debiased and moment video retrieval |
title_full_unstemmed |
Towards semantic, debiased and moment video retrieval |
title_sort |
towards semantic, debiased and moment video retrieval |
publisher |
Nanyang Technological University |
publishDate |
2025 |
url |
https://hdl.handle.net/10356/182104 |
_version_ |
1821237189367824384 |
spelling |
sg-ntu-dr.10356-1821042025-01-09T04:40:30Z Towards semantic, debiased and moment video retrieval Satar, Burak - College of Computing and Data Science Institute for Infocomm Research (I2R), A*STAR Lim Joo Hwee joohwee_lim@ntu.edu.sg Computer and Information Science Video retrieval Video corpus moment retrieval Cross modal retrieval Long egocentric videos Multimodal fusion with LLM and audio Semantic and hierarchical video retrieval Causal inference for temporal bias Cross-modal representation learning Video retrieval aims to retrieve a whole video within a video corpus given a language query. However, one of the main challenges is that it requires reaching a semantic correlation between these modalities. Besides, imbalanced datasets can cause biases in the retrieval models. Moreover, retrieving a moment from a video corpus corresponding to a text query is even more challenging, especially for long egocentric videos, considering the need for fine-grained cross-modal reasoning. In this thesis, we address these problems to build machines intelligent enough towards semantic, debiased, and moment video retrieval. We first address a crucial task, semantic video retrieval, given that voluminous video clips are uploaded daily. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. We propose a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at three hierarchical levels. The results indicate that our method outperforms the state-of-the-art methods, given the same visual backbone without pre-training. Besides, we conducted extensive ablation studies to elucidate our design choices. Finally, we also extend our method with various improvements in design choices and prove its superiority in competition. However, these improvements can still be prone to various biases, causing models to learn spurious correlations. Thus, we focus on a debiasing model to address a temporal bias specific to video retrieval tasks. While many studies focus on improving pre-training or developing new backbones, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, temporal object co-occurrences on video scene graph generation could induce spurious correlations. We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips as the first attempt to address a temporal bias in text-video retrieval tasks. We foremost hypothesise and verify the bias with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on three different datasets. Our model overpasses the baseline over +2.5 points on nDCG, a semantic-relevancy-focused evaluation metric, mitigating the bias. However, longer videos necessitate users to retrieve specific moments within videos. Therefore, we address the Video Corpus Moment Retrieval (VCMR) task to retrieve specific moments from extensive video corpora. Our approach tackles this challenge for the first time on long, fine-grained, and untrimmed egocentric videos, while existing methodologies target short, coarse-grained, and trimmed third-person videos. This presents a formidable challenge as target moments lack textual information either from speech or subtitles and are much shorter amidst longer video sequences accompanied by shorter narrations. Our approach involves captioning moments over different timestamps using an off-the-shelf tool, enriching the captions with an LLM, and combining them with audio features within the corresponding long video to exploit the sound of object interactions. We establish three strong baselines incorporating the same additional multimodal features for a fair comparison. In two different architectural model designs, we demonstrate a 10\% to 86\% increase in summing the Recall metric over various IoUs compared to the baseline methods. In conclusion, this thesis contributes several vital ideas from different perspectives, i.e., a novel transformer for semantic video retrieval to a causal inference method for debiased video retrieval. Besides, we leverage LLM and audio fusion to address moment retrieval in video corpus for long egocentric videos. Last but not least, we also shed light on future work directions to improve the models' capability. Doctor of Philosophy 2025-01-09T04:40:29Z 2025-01-09T04:40:29Z 2025 Thesis-Doctor of Philosophy Satar, B. (2025). Towards semantic, debiased and moment video retrieval. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/182104 https://hdl.handle.net/10356/182104 en A18A2b0046 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |