Effective action recognition with fully supervised and self-supervised methods
Action recognition in videos has attracted interest in computer vision and machine learning communities thanks to its applications such as surveillance and smart homes. In addition to spatial information in individual frames, videos contain temporal information across the temporal dimension. Therefo...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/152741 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Action recognition in videos has attracted interest in computer vision and machine learning communities thanks to its applications such as surveillance and smart homes. In addition to spatial information in individual frames, videos contain temporal information across the temporal dimension. Therefore, effective spatio-temporal representation is the key to accurate action recognition in videos.
Previous works have proposed various fully-supervised and self-supervised methods for video representation learning. For fully-supervised methods, most of them utilize convolution neural networks (CNNs) to extract spatial representation while temporal representation is usually modelled by pixel-wise correlations. However, it is inefficient to extract correlations between all pixels since some of them may relate to in-salient area (e.g. backgrounds or environments). On the other hand, self-supervised methods are proposed to leverage more accessible unlabled data on the Internet and transfer the extracted representation for different downstream tasks. The core of self-supervised methods is to design a pretext task where supervision signal is automatically generated based on characteristics of unlabeled data. Although self-supervised methods avoid the annotation of labeled data, compared to fully-supervised methods, there is room for performance improvement of self-supervised methods. In this paper, we address the above research gap with two novel deep learning methods, to advance fully-supervised and self-supervised methods, respectively.
For fully-supervised learning, we propose a novel Key Point Shift Embedding Module (KPSEM) to adaptively extract channel-wise key point shifts across video frames without key point annotation for temporal feature extraction. Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points. The key point shifts are encoded as the overall temporal features via linear embedding layers in a multi-set manner.
To advance self-supervised learning, we propose a novel self-supervised learning method, called Video Incoherence Detection (VID), that leverages incoherence detection for spatio-temporal feature extraction. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the relative location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between different incoherent clips from the same raw video.
Our experiments show that both KPSEM and VID achieve state-of-the-art performance on action recognition with fully-supervised and self-supervised learning, respectively. Thorough ablation studies are also conducted to justify the performance of both proposed methods. |
---|