Vision-based hand pose estimation and gesture recognition

Real-time hand pose estimation and gesture recognition from the visual inputs are important for human-computer interaction. Compared to the specialized hardware to fulfill this task, e.g., the data-gloves, the vision-based methods are much cheaper and capable of providing more natural and non-intrus...

Full description

Saved in:
Bibliographic Details
Main Author: Liang, Hui
Other Authors: Daniel Thalmann
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/65842
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Real-time hand pose estimation and gesture recognition from the visual inputs are important for human-computer interaction. Compared to the specialized hardware to fulfill this task, e.g., the data-gloves, the vision-based methods are much cheaper and capable of providing more natural and non-intrusive interaction experiences. Despite the previous work in this field, this problem remains challenging due to the high flexibility and shape variations of the articulated hand. This thesis focuses on inference of the full degree-of-freedom hand pose and semantic hand gestures from the visual image inputs. Particularly, the RGB-D camera is selected as the input device to record the hand images for the proposed methods in this thesis, since it can capture more detailed 3D structure of the hand and is thus less ambiguous for hand pose and gesture inference. To develop a practical solution, the various related aspects have been investigated, such as hand part recognition, fingertip tracking and hand skeleton extraction. Overall, this thesis has made the following contributions: 1. A hand parsing scheme to extract the hand parts from the depth images. It consists of a novel Superpixel-Markov Random Field to enforce both the spatial smoothness and the co-occurrence priors of the hand part labels to improve per-pixel classification results, which proves more superior to pixel-level filtering. In addition, the method generalizes well to human body parsing. In the follow-up work on hand pose estimation, the parsed hand parts prove to be effective for discriminative pose prediction by enforcing the correlations among the pose parameters. A hand gesture recognition method is proposed to take advantage of the parsed hand parts, which reports high accuracy and robustness to hand rotation. 2. A model-based framework for hand pose estimation with continuous depth image sequence input. It adopts a divide-and-conquer scheme and combines fingertip tracking and articulated iterative closest point approach to recover the hand pose. To track the fingertip robustly, we propose several novel depth features to differentiate the fingertip and non-fingertip points and utilize the particle filter to track the fingertips through successive frames. Compared to previous methods, our proposed method can locate the fingertip position for each of the five fingertips accurately for relatively complex hand configurations. For hand pose inference, the tracked fingertip positions provide an initial estimate and an articulated ICP algorithm is utilized for further refinement. 3. A discriminative framework to predict the 3D hand joint positions from a single depth image, which addresses the self-occlusion issue encountered in our model-based framework. It enforces the hand part correlations to improve the regression forest based methods from two different aspects. First, the hand parts are utilized as the additional cue for regression. Second, a Multi-modal Prediction Fusion algorithm is proposed to fuse the ambiguous per-pixel predictions within the low dimensional hand pose manifold. This method improves the prediction accuracy considerably compared to the competing methods and is especially effective in handling the discrepancies between the synthesized training data and real-world inputs. These proposed approaches are all capable of running in real-time or near real-time. To further exploit their potential in human-computer interaction, various applications are developed, such as hand-based communication with the virtual avatar and virtual object manipulation.