Human pose estimation and action recognition based on monocular video inputs

This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active...

Full description

Saved in:
Bibliographic Details
Main Author: Leong, Mei Chee
Other Authors: Lee Yong Tsui
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/136596
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active research area in computer vision with a wide range of applications. However, there exist major challenges in recovering monocular human motion due to the lack of depth information. Single frame pose estimation and tracking methods have limitations in recovering failed tracking pose and occurrence of motion jitters. To address current limitations in per-frame pose estimation and tracking methods, we propose to directly estimate a sequence of poses from a stack of consecutive frames. We exploit example-based method with dense spatio-temporal features to find best matching poses and then perform interpolation to achieve smooth motion reconstruction. For action recognition task, we exploit learning-base method, specifically deep learning with Convolutional Neural Network (CNN), for effective learning of spatio-temporal features to identify different action class in high volume of video dataset. In the initial study, a number of experiments were conducted to evaluate the effectiveness of configurations in our architecture, followed by an extended study on deeper models. Lastly, we developed a generalized architecture with fusion of 1D, 2D and 3D convolution layers, that can be adopted to existing CNN models while retaining the network’s learning properties. Our empirical studies demonstrated the advantages of our architecture over its corresponding 3D CNN models in: 1) a boost of 16 – 30% improved prediction accuracy, 2) effective spatio-temporal learning, and 3) lower computational cost. The future goal of this project is to link all the main processes in this research work to develop a full pipeline human motion analysis system that can be applied in real-life applications, such as healthcare or sports analysis.