Interactive known-item search in large video corpora

The surge in video volume makes it challenging to locate a specific target with a single query using automatic video retrieval systems. The interactive video retrieval offers a solution by enabling users to iteratively refine a search. Nevertheless, existing systems often present users with an overw...

Full description

Saved in:
Bibliographic Details
Main Author: MA, Zhixin
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/670
https://ink.library.smu.edu.sg/context/etd_coll/article/1668/viewcontent/GPIS_AY2020_PhD_MaZhixin.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:The surge in video volume makes it challenging to locate a specific target with a single query using automatic video retrieval systems. The interactive video retrieval offers a solution by enabling users to iteratively refine a search. Nevertheless, existing systems often present users with an overwhelming number of similar videos, which can lead to mental fatigue while inspecting results and increase difficulty in providing feedback. This dissertation studies known-item video search and addresses four key challenges. First and foremost, as the link between users and the system, the interaction must be both efficient and effective. To ensure effectiveness, the user’s workload should be minimized, as excessive effort can hinder their ability to thoroughly review the results. For instance, when browsing videos with similar content, users may overlook the desired target due to the cumbersome nature of the search process. Efficiency, in this context, means reducing the number of iterations, as users are likely to abandon the task after a few unsuccessful attempts. Second, bridging the information gap between user intention and feedback is not trivial. As it is not realistic to assume that user intentions can always be clearly conveyed, there is always an information gap between user intention and system-received feedback. Inconsistencies between user intention and system understanding pose significant challenges for subsequent analyses, including intention prediction, query history analysis, and ranking optimization. Third, resilience to noise or irrelevant search results is critically important, especially when narrowing the search space for more efficient search and browsing. When noise dominates the top-rank search results, the target could be more easily pruned during the process of search space refinement. A robust system should be able to identify and filter out noise to prevent from steering the search towards irrelevant results. Lastly, recruiting human evaluators to interact with a search system for training data collection is often impractical.Instead, designing a model to simulate user interactions is a common practice. The user simulation must be feasible and closely replicate human behavior. Since the interactive system is intended for human users, the simulator needs to learn how to appropriately respond to human feedback. As large-scale training data from real users is unavailable, simulators should be designed to mimic human behavior as accurately as possible. In response to these challenges, this dissertation addresses the problems from the following aspects. First, to avoid users from being overwhelmed by massive similar content during video searches, we present a reinforcement learning (RL)-based framework. The framework enables users to provide keyword feedback, guiding the system to navigate the search space more effectively. By continuously learning from user feedback, the search systemcaniteratively plan a navigation path and recommend a potential target that maximizes long-term rewards. We evaluate this approach through experiments on the challenging task of video corpus moment retrieval (VCMR) to localize specific moments from a large video corpus. The experimental results on the TVR and DiDeMo datasets demonstrate that our method is effective in retrieving moments hidden deep inside the ranked lists of CONQUER and HERO, which are state-of-the-art VCMR auto-search engines. Second, we revisit PicHunter, an effective relevance feedback system designed for known-itemsearch (KIS), which faces two primary challenges: the inconsistency of “relative judgment” and the need for an optimized display strategy. Inconsistent relative judgments provided by users severely degrade system performance, while a suboptimal display strategy hinders the goal of minimizing search iterations. To assess PicHunter’s performance in handling large-scale video datasets and interpretable embedding-based representations, we test the system on the V3C dataset and measure its sensitivity to inconsistent user feedback. Additionally, we introduce a pairwise relative judgment feedback to alleviate user workload in decision-making and reduce the computational complexity of probability updates. Moreover, an RL agent is trained to optimize the display strategy, reduce interaction rounds, and enhance retrieval performance. Our empirical results on the V3C2 dataset composed of 1,425,454 video shots show that the proposed display model improves PicHunter’s performance by 4.6% when users are assumed to provide consistently accuraterelative judgments. When there is a misalignment between user and system perceptions, although the system’s performance declines significantly due to its sensitivity to the misalignment, the proposed strategy mitigates sensitivity and prevents significant performance degradation. Lastly, to address the misalignment between user and machine perceptions, we introduce a robust relevance feedback method for KIS, which decomposes user perception into multiple sub-perceptions. The assumption is that while users maynotconsistently align with a single feature representation, they are more likely to agree with one or several within a set of feature spaces. To support this, we propose a predictive user model that predicts the composition of user perception for each relative judgment. A confidence score of each subperception is estimated for each user judgment and integrated into the Bayesian update, increasing the system’s tolerance for inconsistency in relevance feedback. Experimental results on the V3C2 dataset demonstrate that our proposed method offers up to 54.67% likelihood of identifying the target video within a search depth ranging from 10 to 5,000 ranks based on the model-aligned relevance feedback. These studies have contributed to developing an efficient, noise-resilient interactive video search system.