Vision-based 3D human and hand pose analysis
Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the gre...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/153319 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-153319 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Cai, Yujun Vision-based 3D human and hand pose analysis |
description |
Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused
long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the great progress in this field, it is still challenging to obtain accurate 3D pose estimation, predict future motions, and synthesize realistic human behaviors due to the physical complexity of human/hand motion and the lack of high-quality dataset. To address these issues, in this thesis, four chapters are proposed to investigate these tasks.
For 3D pose estimation, I mainly focus on two important aspects, how to alleviate the burden of 3D annotations, and how to better exploit the spatial-temporal correlations of human/hand structure. For the first aspect, different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, I propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing only RGB inputs are used for 3D joint predictions. In this way, the burden of costly 3D annotations is alleviated for real-world dataset. For the second aspect, motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies, a novel graph-based method is proposed to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation.
For 3D motion prediction, I aim to capture complicated structures and explore the motion patterns of human behaviors. Specifically, a transformer-based architecture is applied to simultaneously capture the long-range temporal correlations and spatial dependencies. To exploit the kinematic chains of body skeletons, a progressive strategy is deployed, which explicitly decomposes the future joint motion predictions into progressive steps, performed in a central-to-peripheral manner according to the structural connectivity. To further enable a generalized full-spectrum human motion space across all videos in training data, a memory-based dictionary was proposed to provide auxiliary information to enhance the prediction quality.
For 3D motion synthesis, I aim to find a unified architecture for various 3D motion synthesis tasks, since most existing methods are either restricted to one type of motion synthesis or use different approaches to address various tasks. In particular, I propose a framework based on the Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series. To further allow the flexibility of manipulating the motion style of the generated series, an Action-Adaptive Modulation (AAM) is designed to propagate the given semantic guidance through the whole sequence.
To summarize, this thesis focuses on 3D human and hand pose analysis for images and videos. Novel neural networks are developed to improve the 3d pose estimation accuracy in an end-to-end manner. Meanwhile, motion prediction strategy and a unified motion synthesis model are proposed in this thesis, which significantly contributes to human motion tracking and complex human gesture animations. |
author2 |
Cham Tat Jen |
author_facet |
Cham Tat Jen Cai, Yujun |
format |
Thesis-Doctor of Philosophy |
author |
Cai, Yujun |
author_sort |
Cai, Yujun |
title |
Vision-based 3D human and hand pose analysis |
title_short |
Vision-based 3D human and hand pose analysis |
title_full |
Vision-based 3D human and hand pose analysis |
title_fullStr |
Vision-based 3D human and hand pose analysis |
title_full_unstemmed |
Vision-based 3D human and hand pose analysis |
title_sort |
vision-based 3d human and hand pose analysis |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/153319 |
_version_ |
1759856614191398912 |
spelling |
sg-ntu-dr.10356-1533192023-03-05T16:36:33Z Vision-based 3D human and hand pose analysis Cai, Yujun Cham Tat Jen Interdisciplinary Graduate School (IGS) Institute for Media Innovation (IMI) ASTJCham@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the great progress in this field, it is still challenging to obtain accurate 3D pose estimation, predict future motions, and synthesize realistic human behaviors due to the physical complexity of human/hand motion and the lack of high-quality dataset. To address these issues, in this thesis, four chapters are proposed to investigate these tasks. For 3D pose estimation, I mainly focus on two important aspects, how to alleviate the burden of 3D annotations, and how to better exploit the spatial-temporal correlations of human/hand structure. For the first aspect, different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, I propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing only RGB inputs are used for 3D joint predictions. In this way, the burden of costly 3D annotations is alleviated for real-world dataset. For the second aspect, motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies, a novel graph-based method is proposed to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation. For 3D motion prediction, I aim to capture complicated structures and explore the motion patterns of human behaviors. Specifically, a transformer-based architecture is applied to simultaneously capture the long-range temporal correlations and spatial dependencies. To exploit the kinematic chains of body skeletons, a progressive strategy is deployed, which explicitly decomposes the future joint motion predictions into progressive steps, performed in a central-to-peripheral manner according to the structural connectivity. To further enable a generalized full-spectrum human motion space across all videos in training data, a memory-based dictionary was proposed to provide auxiliary information to enhance the prediction quality. For 3D motion synthesis, I aim to find a unified architecture for various 3D motion synthesis tasks, since most existing methods are either restricted to one type of motion synthesis or use different approaches to address various tasks. In particular, I propose a framework based on the Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series. To further allow the flexibility of manipulating the motion style of the generated series, an Action-Adaptive Modulation (AAM) is designed to propagate the given semantic guidance through the whole sequence. To summarize, this thesis focuses on 3D human and hand pose analysis for images and videos. Novel neural networks are developed to improve the 3d pose estimation accuracy in an end-to-end manner. Meanwhile, motion prediction strategy and a unified motion synthesis model are proposed in this thesis, which significantly contributes to human motion tracking and complex human gesture animations. Doctor of Philosophy 2021-11-23T03:46:31Z 2021-11-23T03:46:31Z 2021 Thesis-Doctor of Philosophy Cai, Y. (2021). Vision-based 3D human and hand pose analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153319 https://hdl.handle.net/10356/153319 10.32657/10356/153319 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |