Learning with few labels for skeleton-based action recognition
Human Action Recognition, which involves discerning human actions, is vital for many real-world applications. Skeleton sequences, tracing human body joint trajectories, capture essential human motions, making them appropriate for action recognition. Compared to RGB videos or depth data, 3D skeleton...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/173603 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Human Action Recognition, which involves discerning human actions, is vital for many real-world applications. Skeleton sequences, tracing human body joint trajectories, capture essential human motions, making them appropriate for action recognition. Compared to RGB videos or depth data, 3D skeleton data offers concise representations of human behaviors, proving robust against appearance variations, distractions, and viewpoint changes. This has led to increased interest in skeleton-based action recognition research.
With the advance of deep learning, deep neural networks (e.g., CNN, RNN, and GCN) have been widely studied to model the spatio-temporal representation of skeleton action sequences under supervised scenarios. However, supervised learning methods typically necessitate substantial data with expensive labels for model training, which is often challenging and costly to obtain. Additionally, labeling and vetting massive amounts of real-world training data is certainly difficult, expensive, or time-consuming. As such, learning effective feature representations with minimal annotations becomes a critical necessity.
Thus, in this thesis, we make efforts to explore efficient ways to address this problem.
Particularly, we investigate the weakly-supervised, self-supervised, and one-shot learning methods to solve the skeleton action recognition under the fewer label issue. Firstly, we introduce a unique collaborative learning network designed for simultaneous gesture recognition and 3D hand pose estimation, capitalizing on joint-aware features. Additionally, we propose a weakly supervised learning scheme that is capable of leveraging hand pose (or gesture) annotations to learn powerful gesture recognition (or pose estimation) models. Secondly, we present the concept of self-supervised action representation learning as a task of repainting 3D skeleton clouds. In this framework, each skeleton sequence is viewed as a skeleton cloud and processed using a point cloud auto-encoder. We introduce an innovative colorization technique for the skeleton cloud where each point is colored according to its temporal and spatial orders in the sequence. These color labels act as self-supervision signals, greatly enhancing the self-supervised learning of skeleton action representations. Lastly, we formulate one-shot skeleton action recognition as an optimal matching problem and design an effective network framework for one-shot skeleton action recognition. We propose a multi-scale matching strategy that can capture scale-wise skeleton semantic relevance at multiple spatial and temporal scales. Building on this, we design a novel cross-scale matching scheme that can model the within-class variation of human actions in motion magnitudes and motion paces.
To validate the efficacy of our proposed approaches, we carried out comprehensive experiments across various datasets. The findings demonstrate a notable improvement over existing methodologies. |
---|