Human pose estimation and action recognition based on monocular video inputs

This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active...

Full description

Saved in:
Bibliographic Details
Main Author: Leong, Mei Chee
Other Authors: Lee Yong Tsui
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/136596
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-136596
record_format dspace
spelling sg-ntu-dr.10356-1365962020-11-01T04:57:44Z Human pose estimation and action recognition based on monocular video inputs Leong, Mei Chee Lee Yong Tsui Lin Feng Interdisciplinary Graduate School (IGS) mytlee@ntu.edu.sg, asflin@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active research area in computer vision with a wide range of applications. However, there exist major challenges in recovering monocular human motion due to the lack of depth information. Single frame pose estimation and tracking methods have limitations in recovering failed tracking pose and occurrence of motion jitters. To address current limitations in per-frame pose estimation and tracking methods, we propose to directly estimate a sequence of poses from a stack of consecutive frames. We exploit example-based method with dense spatio-temporal features to find best matching poses and then perform interpolation to achieve smooth motion reconstruction. For action recognition task, we exploit learning-base method, specifically deep learning with Convolutional Neural Network (CNN), for effective learning of spatio-temporal features to identify different action class in high volume of video dataset. In the initial study, a number of experiments were conducted to evaluate the effectiveness of configurations in our architecture, followed by an extended study on deeper models. Lastly, we developed a generalized architecture with fusion of 1D, 2D and 3D convolution layers, that can be adopted to existing CNN models while retaining the network’s learning properties. Our empirical studies demonstrated the advantages of our architecture over its corresponding 3D CNN models in: 1) a boost of 16 – 30% improved prediction accuracy, 2) effective spatio-temporal learning, and 3) lower computational cost. The future goal of this project is to link all the main processes in this research work to develop a full pipeline human motion analysis system that can be applied in real-life applications, such as healthcare or sports analysis. Doctor of Philosophy 2020-01-06T05:41:11Z 2020-01-06T05:41:11Z 2019 Thesis-Doctor of Philosophy Leong, M. C. (2019). Human pose estimation and action recognition based on monocular video inputs. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/136596 10.32657/10356/136596 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Leong, Mei Chee
Human pose estimation and action recognition based on monocular video inputs
description This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active research area in computer vision with a wide range of applications. However, there exist major challenges in recovering monocular human motion due to the lack of depth information. Single frame pose estimation and tracking methods have limitations in recovering failed tracking pose and occurrence of motion jitters. To address current limitations in per-frame pose estimation and tracking methods, we propose to directly estimate a sequence of poses from a stack of consecutive frames. We exploit example-based method with dense spatio-temporal features to find best matching poses and then perform interpolation to achieve smooth motion reconstruction. For action recognition task, we exploit learning-base method, specifically deep learning with Convolutional Neural Network (CNN), for effective learning of spatio-temporal features to identify different action class in high volume of video dataset. In the initial study, a number of experiments were conducted to evaluate the effectiveness of configurations in our architecture, followed by an extended study on deeper models. Lastly, we developed a generalized architecture with fusion of 1D, 2D and 3D convolution layers, that can be adopted to existing CNN models while retaining the network’s learning properties. Our empirical studies demonstrated the advantages of our architecture over its corresponding 3D CNN models in: 1) a boost of 16 – 30% improved prediction accuracy, 2) effective spatio-temporal learning, and 3) lower computational cost. The future goal of this project is to link all the main processes in this research work to develop a full pipeline human motion analysis system that can be applied in real-life applications, such as healthcare or sports analysis.
author2 Lee Yong Tsui
author_facet Lee Yong Tsui
Leong, Mei Chee
format Thesis-Doctor of Philosophy
author Leong, Mei Chee
author_sort Leong, Mei Chee
title Human pose estimation and action recognition based on monocular video inputs
title_short Human pose estimation and action recognition based on monocular video inputs
title_full Human pose estimation and action recognition based on monocular video inputs
title_fullStr Human pose estimation and action recognition based on monocular video inputs
title_full_unstemmed Human pose estimation and action recognition based on monocular video inputs
title_sort human pose estimation and action recognition based on monocular video inputs
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/136596
_version_ 1683494056672362496