Vision-based 3D human and hand pose analysis

Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the gre...

Full description

Saved in:
Bibliographic Details
Main Author: Cai, Yujun
Other Authors: Cham Tat Jen
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153319
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-153319
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Cai, Yujun
Vision-based 3D human and hand pose analysis
description Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the great progress in this field, it is still challenging to obtain accurate 3D pose estimation, predict future motions, and synthesize realistic human behaviors due to the physical complexity of human/hand motion and the lack of high-quality dataset. To address these issues, in this thesis, four chapters are proposed to investigate these tasks. For 3D pose estimation, I mainly focus on two important aspects, how to alleviate the burden of 3D annotations, and how to better exploit the spatial-temporal correlations of human/hand structure. For the first aspect, different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, I propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing only RGB inputs are used for 3D joint predictions. In this way, the burden of costly 3D annotations is alleviated for real-world dataset. For the second aspect, motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies, a novel graph-based method is proposed to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation. For 3D motion prediction, I aim to capture complicated structures and explore the motion patterns of human behaviors. Specifically, a transformer-based architecture is applied to simultaneously capture the long-range temporal correlations and spatial dependencies. To exploit the kinematic chains of body skeletons, a progressive strategy is deployed, which explicitly decomposes the future joint motion predictions into progressive steps, performed in a central-to-peripheral manner according to the structural connectivity. To further enable a generalized full-spectrum human motion space across all videos in training data, a memory-based dictionary was proposed to provide auxiliary information to enhance the prediction quality. For 3D motion synthesis, I aim to find a unified architecture for various 3D motion synthesis tasks, since most existing methods are either restricted to one type of motion synthesis or use different approaches to address various tasks. In particular, I propose a framework based on the Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series. To further allow the flexibility of manipulating the motion style of the generated series, an Action-Adaptive Modulation (AAM) is designed to propagate the given semantic guidance through the whole sequence. To summarize, this thesis focuses on 3D human and hand pose analysis for images and videos. Novel neural networks are developed to improve the 3d pose estimation accuracy in an end-to-end manner. Meanwhile, motion prediction strategy and a unified motion synthesis model are proposed in this thesis, which significantly contributes to human motion tracking and complex human gesture animations.
author2 Cham Tat Jen
author_facet Cham Tat Jen
Cai, Yujun
format Thesis-Doctor of Philosophy
author Cai, Yujun
author_sort Cai, Yujun
title Vision-based 3D human and hand pose analysis
title_short Vision-based 3D human and hand pose analysis
title_full Vision-based 3D human and hand pose analysis
title_fullStr Vision-based 3D human and hand pose analysis
title_full_unstemmed Vision-based 3D human and hand pose analysis
title_sort vision-based 3d human and hand pose analysis
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/153319
_version_ 1759856614191398912
spelling sg-ntu-dr.10356-1533192023-03-05T16:36:33Z Vision-based 3D human and hand pose analysis Cai, Yujun Cham Tat Jen Interdisciplinary Graduate School (IGS) Institute for Media Innovation (IMI) ASTJCham@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the great progress in this field, it is still challenging to obtain accurate 3D pose estimation, predict future motions, and synthesize realistic human behaviors due to the physical complexity of human/hand motion and the lack of high-quality dataset. To address these issues, in this thesis, four chapters are proposed to investigate these tasks. For 3D pose estimation, I mainly focus on two important aspects, how to alleviate the burden of 3D annotations, and how to better exploit the spatial-temporal correlations of human/hand structure. For the first aspect, different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, I propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing only RGB inputs are used for 3D joint predictions. In this way, the burden of costly 3D annotations is alleviated for real-world dataset. For the second aspect, motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies, a novel graph-based method is proposed to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation. For 3D motion prediction, I aim to capture complicated structures and explore the motion patterns of human behaviors. Specifically, a transformer-based architecture is applied to simultaneously capture the long-range temporal correlations and spatial dependencies. To exploit the kinematic chains of body skeletons, a progressive strategy is deployed, which explicitly decomposes the future joint motion predictions into progressive steps, performed in a central-to-peripheral manner according to the structural connectivity. To further enable a generalized full-spectrum human motion space across all videos in training data, a memory-based dictionary was proposed to provide auxiliary information to enhance the prediction quality. For 3D motion synthesis, I aim to find a unified architecture for various 3D motion synthesis tasks, since most existing methods are either restricted to one type of motion synthesis or use different approaches to address various tasks. In particular, I propose a framework based on the Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series. To further allow the flexibility of manipulating the motion style of the generated series, an Action-Adaptive Modulation (AAM) is designed to propagate the given semantic guidance through the whole sequence. To summarize, this thesis focuses on 3D human and hand pose analysis for images and videos. Novel neural networks are developed to improve the 3d pose estimation accuracy in an end-to-end manner. Meanwhile, motion prediction strategy and a unified motion synthesis model are proposed in this thesis, which significantly contributes to human motion tracking and complex human gesture animations. Doctor of Philosophy 2021-11-23T03:46:31Z 2021-11-23T03:46:31Z 2021 Thesis-Doctor of Philosophy Cai, Y. (2021). Vision-based 3D human and hand pose analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153319 https://hdl.handle.net/10356/153319 10.32657/10356/153319 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University