Vision-based 3D human and hand pose analysis

Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the gre...

Full description

Saved in:
Bibliographic Details
Main Author: Cai, Yujun
Other Authors: Cham Tat Jen
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153319
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Vision-based 3D human and hand pose analysis has been a fast-growing research area and has aroused long-standing research attention in the past decades, since it plays a significant role in numerous applications such as human-computer interactions, robotics, and gesture recognition. Despite the great progress in this field, it is still challenging to obtain accurate 3D pose estimation, predict future motions, and synthesize realistic human behaviors due to the physical complexity of human/hand motion and the lack of high-quality dataset. To address these issues, in this thesis, four chapters are proposed to investigate these tasks. For 3D pose estimation, I mainly focus on two important aspects, how to alleviate the burden of 3D annotations, and how to better exploit the spatial-temporal correlations of human/hand structure. For the first aspect, different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, I propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing only RGB inputs are used for 3D joint predictions. In this way, the burden of costly 3D annotations is alleviated for real-world dataset. For the second aspect, motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies, a novel graph-based method is proposed to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation. For 3D motion prediction, I aim to capture complicated structures and explore the motion patterns of human behaviors. Specifically, a transformer-based architecture is applied to simultaneously capture the long-range temporal correlations and spatial dependencies. To exploit the kinematic chains of body skeletons, a progressive strategy is deployed, which explicitly decomposes the future joint motion predictions into progressive steps, performed in a central-to-peripheral manner according to the structural connectivity. To further enable a generalized full-spectrum human motion space across all videos in training data, a memory-based dictionary was proposed to provide auxiliary information to enhance the prediction quality. For 3D motion synthesis, I aim to find a unified architecture for various 3D motion synthesis tasks, since most existing methods are either restricted to one type of motion synthesis or use different approaches to address various tasks. In particular, I propose a framework based on the Conditional Variational Auto-Encoder (CVAE), where any arbitrary input is treated as a masked motion series. To further allow the flexibility of manipulating the motion style of the generated series, an Action-Adaptive Modulation (AAM) is designed to propagate the given semantic guidance through the whole sequence. To summarize, this thesis focuses on 3D human and hand pose analysis for images and videos. Novel neural networks are developed to improve the 3d pose estimation accuracy in an end-to-end manner. Meanwhile, motion prediction strategy and a unified motion synthesis model are proposed in this thesis, which significantly contributes to human motion tracking and complex human gesture animations.