Human pose estimation and action recognition based on monocular video inputs
This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/136596 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active research area in computer vision with a wide range of applications. However, there exist major challenges in recovering monocular human motion due to the lack of depth information. Single frame pose estimation and tracking methods have limitations in recovering failed tracking pose and occurrence of motion jitters. To address current limitations in per-frame pose estimation and tracking methods, we propose to directly estimate a sequence of poses from a stack of consecutive frames. We exploit example-based method with dense spatio-temporal features to find best matching poses and then perform interpolation to achieve smooth motion reconstruction. For action recognition task, we exploit learning-base method, specifically deep learning with Convolutional Neural Network (CNN), for effective learning of spatio-temporal features to identify different action class in high volume of video dataset. In the initial study, a number of experiments were conducted to evaluate the effectiveness of configurations in our architecture, followed by an extended study on deeper models. Lastly, we developed a generalized architecture with fusion of 1D, 2D and 3D convolution layers, that can be adopted to existing CNN models while retaining the network’s learning properties. Our empirical studies demonstrated the advantages of our architecture over its corresponding 3D CNN models in: 1) a boost of 16 – 30% improved prediction accuracy, 2) effective spatio-temporal learning, and 3) lower computational cost. The future goal of this project is to link all the main processes in this research work to develop a full pipeline human motion analysis system that can be applied in real-life applications, such as healthcare or sports analysis. |
---|