Recognizing and predicting human actions with depth camera

Understanding human behavior from videos is a very important task in computer vision community. It is a significant sub-branch of video analysis. Human behav- ior analysis is widely applied in many application scenarios like human-computer interaction, video surveillance, video retrieval and autonom...

Full description

Saved in:
Bibliographic Details
Main Author: Weng, Junwu
Other Authors: Jiang Xudong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/138384
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-138384
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
spellingShingle Engineering::Electrical and electronic engineering
Weng, Junwu
Recognizing and predicting human actions with depth camera
description Understanding human behavior from videos is a very important task in computer vision community. It is a significant sub-branch of video analysis. Human behav- ior analysis is widely applied in many application scenarios like human-computer interaction, video surveillance, video retrieval and autonomous driving. Thanks to the development of commodity depth cameras, skeleton-based human behavior analysis has drawn considerable attention in the computer vision community re- cently. Skeleton action sequences are extracted from depth camera based on pose estimation algorithms or directly detected from motion capture devices. Compared with RGB-based action sequences, skeleton-based action instances are more sim- plified and semantic. However, the limitation is that there is less appearance and few scene information provided in skeletal data. How to design suitable and stable models to understand human behavior, both body action and hand gesture, using skeletal data is an interesting and challenging topic. To well understand human behavior through action sequence, two tasks are very important, namely the ac- tion recognition and the action prediction. In this thesis, four different models are proposed to handle these distinctive tasks. Due to the success of deep learning models in image recognition, most of the state- of-the-arts choose to utilize deep learning as the tool for skeleton-based action recognition. However, compared with images and videos which are composed of millions or billions of pixels, the skeleton is composed by only tens of joints which is thus of much less complexity than images and videos. For such a light-weight data, non-parametric models like Naive-Bayes Nearest Neighbor (NBNN) may be more suitable than the deep learning models with high complexity. In the first two works of this thesis, two robust NBNN-based models, ST-NBNN and ST-NBMIM, are proposed to characterize skeleton sequences. Besides, to better understand skeleton-based actions, the bilinear classifiers are adopted to identify both key tem- poral stages as well as spatial joints for action classification. Although only using a linear classifier, experiments on five benchmark datasets show that by combin- ing the strength of both non-parametric and parametric models, ST-NBNN and ST-NBMIM can achieve competitive performance compared with state-of-the-art results using sophisticated models such as deep learning. Moreover, by identifying key skeleton joints and temporal stages for each action class, the two NBNN-based models can capture the essential spatio-temporal patterns that play key roles of recognizing actions, which is not always achievable by using end-to-end models. When facing the large-scale skeleton data, the non-parametric model reaches its limitation, and the deep-learning-based models demonstrate their superior perfor- mance on dataset with large size. Meanwhile, human body movements exhibit spatial patterns among pose joints. It is thus of great importance to identify those motion patterns and avoid the non-informative joints, via identifying the key combinations of joints that matter for the recognition. Although key spatio- temporal patterns discovery has been explored previously for skeleton-based action recognition, the temporal dynamics modeling of key joint combinations is not well researched in the community. In the third work of this thesis, a CNN model is proposed to adaptively search key pose joints for each action sequence. The work utilizes the deep-learning technique to train a deformable CNN model to discover sample-related key spatio-temporal patterns for action recognition. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy joints. The proposed model is evaluated on three benchmark datasets and the experimental results show the effectiveness of introducing temporal dynamics modeling of key joint combinations into the skeleton-based action recognition. The goal of early action recognition is to predict action label when the sequence is partially observed. The existing methods treat the early action recognition task as sequential classification problems on different observation ratios of an action se- quence. Since these models are trained by differentiating positive categories from all negative classes, the diverse information of different negative categories is ignored, which we believe can be collected to help improve the recognition performance. In the last work of this thesis, a new direction, introducing category exclusion to early action recognition, is explored. The category exclusion is modeled as a mask operation on the classification probability output of a pre-trained early action recognition classifier. Specifically, policy-based reinforcement learning is utilized to train an agent. The agent generates a series of binary masks to exclude interfering negative categories during action execution and hence help improve the recogni- tion accuracy. The proposed method is evaluated on three benchmark recognition datasets, and it enhances the recognition accuracy consistently over all different observation ratios on the three datasets, where the accuracy improvements on the early stages are especially significant. In summary, this thesis demonstrates the superior performance of the proposed four methods, including the ST-NBNN, ST-NBMIM, Deformable Pose Traversal Convolution, and the Category Exclusion Agent, for the tasks of action recognition and action prediction of skeleton-based sequences. These four models are exten- sively evaluated on well-known benchmark datasets and the experimental results show the effectiveness of these models on their corresponding tasks.
author2 Jiang Xudong
author_facet Jiang Xudong
Weng, Junwu
format Thesis-Doctor of Philosophy
author Weng, Junwu
author_sort Weng, Junwu
title Recognizing and predicting human actions with depth camera
title_short Recognizing and predicting human actions with depth camera
title_full Recognizing and predicting human actions with depth camera
title_fullStr Recognizing and predicting human actions with depth camera
title_full_unstemmed Recognizing and predicting human actions with depth camera
title_sort recognizing and predicting human actions with depth camera
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/138384
_version_ 1772826952466432000
spelling sg-ntu-dr.10356-1383842023-07-04T17:23:22Z Recognizing and predicting human actions with depth camera Weng, Junwu Jiang Xudong School of Electrical and Electronic Engineering exdjiang@ntu.edu.sg Engineering::Electrical and electronic engineering Understanding human behavior from videos is a very important task in computer vision community. It is a significant sub-branch of video analysis. Human behav- ior analysis is widely applied in many application scenarios like human-computer interaction, video surveillance, video retrieval and autonomous driving. Thanks to the development of commodity depth cameras, skeleton-based human behavior analysis has drawn considerable attention in the computer vision community re- cently. Skeleton action sequences are extracted from depth camera based on pose estimation algorithms or directly detected from motion capture devices. Compared with RGB-based action sequences, skeleton-based action instances are more sim- plified and semantic. However, the limitation is that there is less appearance and few scene information provided in skeletal data. How to design suitable and stable models to understand human behavior, both body action and hand gesture, using skeletal data is an interesting and challenging topic. To well understand human behavior through action sequence, two tasks are very important, namely the ac- tion recognition and the action prediction. In this thesis, four different models are proposed to handle these distinctive tasks. Due to the success of deep learning models in image recognition, most of the state- of-the-arts choose to utilize deep learning as the tool for skeleton-based action recognition. However, compared with images and videos which are composed of millions or billions of pixels, the skeleton is composed by only tens of joints which is thus of much less complexity than images and videos. For such a light-weight data, non-parametric models like Naive-Bayes Nearest Neighbor (NBNN) may be more suitable than the deep learning models with high complexity. In the first two works of this thesis, two robust NBNN-based models, ST-NBNN and ST-NBMIM, are proposed to characterize skeleton sequences. Besides, to better understand skeleton-based actions, the bilinear classifiers are adopted to identify both key tem- poral stages as well as spatial joints for action classification. Although only using a linear classifier, experiments on five benchmark datasets show that by combin- ing the strength of both non-parametric and parametric models, ST-NBNN and ST-NBMIM can achieve competitive performance compared with state-of-the-art results using sophisticated models such as deep learning. Moreover, by identifying key skeleton joints and temporal stages for each action class, the two NBNN-based models can capture the essential spatio-temporal patterns that play key roles of recognizing actions, which is not always achievable by using end-to-end models. When facing the large-scale skeleton data, the non-parametric model reaches its limitation, and the deep-learning-based models demonstrate their superior perfor- mance on dataset with large size. Meanwhile, human body movements exhibit spatial patterns among pose joints. It is thus of great importance to identify those motion patterns and avoid the non-informative joints, via identifying the key combinations of joints that matter for the recognition. Although key spatio- temporal patterns discovery has been explored previously for skeleton-based action recognition, the temporal dynamics modeling of key joint combinations is not well researched in the community. In the third work of this thesis, a CNN model is proposed to adaptively search key pose joints for each action sequence. The work utilizes the deep-learning technique to train a deformable CNN model to discover sample-related key spatio-temporal patterns for action recognition. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy joints. The proposed model is evaluated on three benchmark datasets and the experimental results show the effectiveness of introducing temporal dynamics modeling of key joint combinations into the skeleton-based action recognition. The goal of early action recognition is to predict action label when the sequence is partially observed. The existing methods treat the early action recognition task as sequential classification problems on different observation ratios of an action se- quence. Since these models are trained by differentiating positive categories from all negative classes, the diverse information of different negative categories is ignored, which we believe can be collected to help improve the recognition performance. In the last work of this thesis, a new direction, introducing category exclusion to early action recognition, is explored. The category exclusion is modeled as a mask operation on the classification probability output of a pre-trained early action recognition classifier. Specifically, policy-based reinforcement learning is utilized to train an agent. The agent generates a series of binary masks to exclude interfering negative categories during action execution and hence help improve the recogni- tion accuracy. The proposed method is evaluated on three benchmark recognition datasets, and it enhances the recognition accuracy consistently over all different observation ratios on the three datasets, where the accuracy improvements on the early stages are especially significant. In summary, this thesis demonstrates the superior performance of the proposed four methods, including the ST-NBNN, ST-NBMIM, Deformable Pose Traversal Convolution, and the Category Exclusion Agent, for the tasks of action recognition and action prediction of skeleton-based sequences. These four models are exten- sively evaluated on well-known benchmark datasets and the experimental results show the effectiveness of these models on their corresponding tasks. Doctor of Philosophy 2020-05-05T07:22:30Z 2020-05-05T07:22:30Z 2020 Thesis-Doctor of Philosophy Weng, J. (2020). Recognizing and predicting human actions with depth camera. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/138384 10.32657/10356/138384 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University