Activity recognition in depth videos

Introduction of depth sensors made a big impact on research in visual recognition. By providing 3D information, these cameras help us to have a view-invariant and robust representation of the observed scenes and human bodies. Detection and 3D localization of human body parts are done more accurately...

Full description

Saved in:

Bibliographic Details
Main Author:	Amir Shahroudy
Other Authors:	Ng Tian-Tsong
Format:	Theses and Dissertations
Language:	English
Published:	2016
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Online Access:	https://hdl.handle.net/10356/69072
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	Introduction of depth sensors made a big impact on research in visual recognition. By providing 3D information, these cameras help us to have a view-invariant and robust representation of the observed scenes and human bodies. Detection and 3D localization of human body parts are done more accurately and more efficiently in depth maps in comparison with RGB counterparts. Having the 3D structure of the body parts, the articulated and complex nature of human actions makes the task of action recognition difficult. One approach to handle this complexity is dividing it to the kinetics of body parts and analyzing the actions based on the partial descriptors. As the first work in this thesis, we propose a joint sparse regression based learning method which utilizes the structured sparsity to model each action as a combination of multimodal features from a sparse set of body parts. To represent dynamics and appearance of parts, we employ a heterogeneous set of depth and skeleton based features. The proper structure of multimodal multipart features are formulated into the learning framework via the proposed hierarchical mixed norm, to regularize the structured features of each part and to apply sparsity between them, in favor of a group feature selection. Our experimental results expose the effectiveness of the proposed learning method in which it outperforms other methods in all three tested datasets while saturating one of them by achieving perfect accuracy. In addition to depth based representation of human actions, commonly used 3D sensors also provide RGB videos. It is generally accepted that each of these two modalities has different strengths and limitations for the task of action recognition. Therefore, analysis of the RGB+D videos can help us to better study the complementary properties of these two types of modalities and achieve higher levels of performance. In the second work, we propose a new deep autoencoder-based correlation-independence factorization network to separate input multimodal signals into a hierarchy of extracted components. Further, based on the structure of the features, a structured sparsity learning machine is proposed which utilizes mixed norms to apply regularization within components and group selection between them for better classification performance. Our experimental results show the effectiveness of our cross-modality feature analysis framework by achieving state-of-the-art accuracies for action classification on four challenging benchmark datasets, for which we reduce the error rate by more than 40\% in three datasets and saturating the benchmark for the other one. Recent approaches in depth-based human activity analysis achieved outstanding performance and proved the effectiveness of 3D representation for classification of action classes. Currently available depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of training samples, distinct class labels, camera views and variety of subjects. In the third work, we introduce a large-scale dataset for RGB+D human action recognition with more than 56 thousand video samples and 4 million frames, collected from 40 distinct subjects. Our dataset contains 60 different action classes including daily actions, mutual actions, and medical conditions. In addition, we propose a new recurrent neural network structure to model the long-term temporal correlation of the features of each body part, and utilize them for better action classification. Experimental results show the advantages of applying deep learning methods over state-of-the-art hand-crafted features on the suggested cross-subject and cross-view evaluation criteria for our dataset. The introduction of this large scale dataset will enable the community to apply, develop and adapt various data-hungry learning techniques for the task of depth-based and RGB+D human activity analysis.

Activity recognition in depth videos

Similar Items