Domain adaptation for video action recognition

Humans can effortlessly learn from a specific data distribution and generalize well to various situations without excessive supervision. In contrast, deep learning models often struggle to achieve similar generalization capabilities. This is primarily because deep models are trained with algorithms...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Xiyu
Other Authors: Mao Kezhi
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172273
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Humans can effortlessly learn from a specific data distribution and generalize well to various situations without excessive supervision. In contrast, deep learning models often struggle to achieve similar generalization capabilities. This is primarily because deep models are trained with algorithms that aim to minimize empirical risks on training data and assume that test data share the same distribution as train data. However, significant domain shifts between training (source) and testing (target) data can occur, causing deep models to generalize poorly on target domains and necessitating additional supervision for adaptation. To address this, Video-based Unsupervised Domain Adaptation (VUDA) has been proposed as a cost-efficient approach for transferring video action recognition models from the source domain to an unlabeled target domain. Nonetheless, VUDA relies on strong assumptions, such as identical label spaces and fixed target domains, which may not hold true in real-world applications. Consequently, this thesis aims to eliminate these assumptions to broaden the applicability of video adaptation methods, focusing on two major shortcomings of conventional VUDA methods, e.g., partial domain adaptation (adapting from a source domain with many classes to a target domain with fewer classes) and continual domain adaptation (adapting to continuously changing target domains). For partial domain adaptation, this thesis proposes the Multi-modality Cluster-calibrated partial Adversarial Network (MCAN), which constructs a multi-modal network to extract robust features and a novel calibration method to refine target class distribution estimation, effectively filtering out irrelevant source classes. To further address some real challenges in the field of adapting deep video models, the problem of continuous video domain adaptation is defined and this thesis proposes Confidence-Attentive network with geneRalization enhanced self-knowledge disTillation (CART). This method leverages attentive learning and a novel data generalization enhanced self-knowledge distillation to preserve previously learned knowledge on seen target domains while adapting to newly encountered ones, ultimately providing a performative model for multiple seen target domains at a minimal cost. This thesis evaluates the proposed partial and continuous video domain adaptation methods on existing and newly constructed benchmarks in this thesis. Our results demonstrated significant performance improvements for MCAN and CART, with MCAN showing particularly strong gains when domain shifts were substantial and CART demonstrating a superior capability of preserving learned knowledge. In conclusion, our research findings on partial and continuous domain adaptation effectively broadened the applicability of video domain adaptation methods, making them more general and cost-efficient.