Vision-based 3D human and hand pose analysis

Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the gre...

Full description

Saved in:

Bibliographic Details
Main Author:	Cai, Yujun
Other Authors:	Cham Tat Jen
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	https://hdl.handle.net/10356/153319
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-153319
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Cai, Yujun Vision-based 3D human and hand pose analysis
description	Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the great progress in this field, it is still challenging to obtain accurate 3D pose estimation, predict future motions, and synthesize realistic human behaviors due to the physical complexity of human/hand motion and the lack of high-quality dataset. To address these issues, in this thesis, four chapters are proposed to investigate these tasks. For 3D pose estimation, I mainly focus on two important aspects, how to alleviate the burden of 3D annotations, and how to better exploit the spatial-temporal correlations of human/hand structure. For the first aspect, different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, I propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing only RGB inputs are used for 3D joint predictions. In this way, the burden of costly 3D annotations is alleviated for real-world dataset. For the second aspect, motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies, a novel graph-based method is proposed to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation. For 3D motion prediction, I aim to capture complicated structures and explore the motion patterns of human behaviors. Specifically, a transformer-based architecture is applied to simultaneously capture the long-range temporal correlations and spatial dependencies. To exploit the kinematic chains of body skeletons, a progressive strategy is deployed, which explicitly decomposes the future joint motion predictions into progressive steps, performed in a central-to-peripheral manner according to the structural connectivity. To further enable a generalized full-spectrum human motion space across all videos in training data, a memory-based dictionary was proposed to provide auxiliary information to enhance the prediction quality. For 3D motion synthesis, I aim to find a unified architecture for various 3D motion synthesis tasks, since most existing methods are either restricted to one type of motion synthesis or use different approaches to address various tasks. In particular, I propose a framework based on the Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series. To further allow the flexibility of manipulating the motion style of the generated series, an Action-Adaptive Modulation (AAM) is designed to propagate the given semantic guidance through the whole sequence. To summarize, this thesis focuses on 3D human and hand pose analysis for images and videos. Novel neural networks are developed to improve the 3d pose estimation accuracy in an end-to-end manner. Meanwhile, motion prediction strategy and a unified motion synthesis model are proposed in this thesis, which significantly contributes to human motion tracking and complex human gesture animations.
author2	Cham Tat Jen
author_facet	Cham Tat Jen Cai, Yujun
format	Thesis-Doctor of Philosophy
author	Cai, Yujun
author_sort	Cai, Yujun
title	Vision-based 3D human and hand pose analysis
title_short	Vision-based 3D human and hand pose analysis
title_full	Vision-based 3D human and hand pose analysis
title_fullStr	Vision-based 3D human and hand pose analysis
title_full_unstemmed	Vision-based 3D human and hand pose analysis
title_sort	vision-based 3d human and hand pose analysis
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/153319
_version_	1759856614191398912
spelling	sg-ntu-dr.10356-1533192023-03-05T16:36:33Z Vision-based 3D human and hand pose analysis Cai, Yujun Cham Tat Jen Interdisciplinary Graduate School (IGS) Institute for Media Innovation (IMI) ASTJCham@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the great progress in this field, it is still challenging to obtain accurate 3D pose estimation, predict future motions, and synthesize realistic human behaviors due to the physical complexity of human/hand motion and the lack of high-quality dataset. To address these issues, in this thesis, four chapters are proposed to investigate these tasks. For 3D pose estimation, I mainly focus on two important aspects, how to alleviate the burden of 3D annotations, and how to better exploit the spatial-temporal correlations of human/hand structure. For the first aspect, different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, I propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing only RGB inputs are used for 3D joint predictions. In this way, the burden of costly 3D annotations is alleviated for real-world dataset. For the second aspect, motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies, a novel graph-based method is proposed to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation. For 3D motion prediction, I aim to capture complicated structures and explore the motion patterns of human behaviors. Specifically, a transformer-based architecture is applied to simultaneously capture the long-range temporal correlations and spatial dependencies. To exploit the kinematic chains of body skeletons, a progressive strategy is deployed, which explicitly decomposes the future joint motion predictions into progressive steps, performed in a central-to-peripheral manner according to the structural connectivity. To further enable a generalized full-spectrum human motion space across all videos in training data, a memory-based dictionary was proposed to provide auxiliary information to enhance the prediction quality. For 3D motion synthesis, I aim to find a unified architecture for various 3D motion synthesis tasks, since most existing methods are either restricted to one type of motion synthesis or use different approaches to address various tasks. In particular, I propose a framework based on the Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series. To further allow the flexibility of manipulating the motion style of the generated series, an Action-Adaptive Modulation (AAM) is designed to propagate the given semantic guidance through the whole sequence. To summarize, this thesis focuses on 3D human and hand pose analysis for images and videos. Novel neural networks are developed to improve the 3d pose estimation accuracy in an end-to-end manner. Meanwhile, motion prediction strategy and a unified motion synthesis model are proposed in this thesis, which significantly contributes to human motion tracking and complex human gesture animations. Doctor of Philosophy 2021-11-23T03:46:31Z 2021-11-23T03:46:31Z 2021 Thesis-Doctor of Philosophy Cai, Y. (2021). Vision-based 3D human and hand pose analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153319 https://hdl.handle.net/10356/153319 10.32657/10356/153319 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Vision-based 3D human and hand pose analysis

Similar Items