Vision-based hand pose estimation and gesture recognition
Real-time hand pose estimation and gesture recognition from the visual inputs are important for human-computer interaction. Compared to the specialized hardware to fulfill this task, e.g., the data-gloves, the vision-based methods are much cheaper and capable of providing more natural and non-intrus...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2015
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/65842 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-65842 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition |
spellingShingle |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Liang, Hui Vision-based hand pose estimation and gesture recognition |
description |
Real-time hand pose estimation and gesture recognition from the visual inputs are important for human-computer interaction. Compared to the specialized hardware to fulfill this task, e.g., the data-gloves, the vision-based methods are much cheaper and capable of providing more natural and non-intrusive interaction experiences. Despite the previous work in this field, this problem remains challenging due to the high flexibility and shape variations of the articulated hand. This thesis focuses on inference of the full degree-of-freedom hand pose and semantic hand gestures from the visual image inputs. Particularly, the RGB-D camera is selected as the input device to record the hand images for the proposed methods in this thesis, since it can capture more detailed 3D structure of the hand and is thus less ambiguous for hand pose and gesture inference. To develop a practical solution, the various related aspects have been investigated, such as hand part recognition, fingertip tracking and hand skeleton extraction. Overall, this thesis has made the following contributions: 1. A hand parsing scheme to extract the hand parts from the depth images. It consists of a novel Superpixel-Markov Random Field to enforce both the spatial smoothness and the co-occurrence priors of the hand part labels to improve per-pixel classification results, which proves more superior to pixel-level filtering. In addition, the method generalizes well to human body parsing. In the follow-up work on hand pose estimation, the parsed hand parts prove to be effective for discriminative pose prediction by enforcing the correlations among the pose parameters. A hand gesture recognition method is proposed to take advantage of the parsed hand parts, which reports high accuracy and robustness to hand rotation. 2. A model-based framework for hand pose estimation with continuous depth image sequence input. It adopts a divide-and-conquer scheme and combines fingertip tracking and articulated iterative closest point approach to recover the hand pose. To track the fingertip robustly, we propose several novel depth features to differentiate the fingertip and non-fingertip points and utilize the particle filter to track the fingertips through successive frames. Compared to previous methods, our proposed method can locate the fingertip position for each of the five fingertips accurately for relatively complex hand configurations. For hand pose inference, the tracked fingertip positions provide an initial estimate and an articulated ICP algorithm is utilized for further refinement. 3. A discriminative framework to predict the 3D hand joint positions from a single depth image, which addresses the self-occlusion issue encountered in our model-based framework. It enforces the hand part correlations to improve the regression forest based methods from two different aspects. First, the hand parts are utilized as the additional cue for regression. Second, a Multi-modal Prediction Fusion algorithm is proposed to fuse the ambiguous per-pixel predictions within the low dimensional hand pose manifold. This method improves the prediction accuracy considerably compared to the competing methods and is especially effective in handling the discrepancies between the synthesized training data and real-world inputs.
These proposed approaches are all capable of running in real-time or near real-time. To further exploit their potential in human-computer interaction, various applications are developed, such as hand-based communication with the virtual avatar and virtual object manipulation. |
author2 |
Daniel Thalmann |
author_facet |
Daniel Thalmann Liang, Hui |
format |
Theses and Dissertations |
author |
Liang, Hui |
author_sort |
Liang, Hui |
title |
Vision-based hand pose estimation and gesture recognition |
title_short |
Vision-based hand pose estimation and gesture recognition |
title_full |
Vision-based hand pose estimation and gesture recognition |
title_fullStr |
Vision-based hand pose estimation and gesture recognition |
title_full_unstemmed |
Vision-based hand pose estimation and gesture recognition |
title_sort |
vision-based hand pose estimation and gesture recognition |
publishDate |
2015 |
url |
https://hdl.handle.net/10356/65842 |
_version_ |
1772826230651879424 |
spelling |
sg-ntu-dr.10356-658422023-07-04T16:27:31Z Vision-based hand pose estimation and gesture recognition Liang, Hui Daniel Thalmann Yuan Junsong School of Electrical and Electronic Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Real-time hand pose estimation and gesture recognition from the visual inputs are important for human-computer interaction. Compared to the specialized hardware to fulfill this task, e.g., the data-gloves, the vision-based methods are much cheaper and capable of providing more natural and non-intrusive interaction experiences. Despite the previous work in this field, this problem remains challenging due to the high flexibility and shape variations of the articulated hand. This thesis focuses on inference of the full degree-of-freedom hand pose and semantic hand gestures from the visual image inputs. Particularly, the RGB-D camera is selected as the input device to record the hand images for the proposed methods in this thesis, since it can capture more detailed 3D structure of the hand and is thus less ambiguous for hand pose and gesture inference. To develop a practical solution, the various related aspects have been investigated, such as hand part recognition, fingertip tracking and hand skeleton extraction. Overall, this thesis has made the following contributions: 1. A hand parsing scheme to extract the hand parts from the depth images. It consists of a novel Superpixel-Markov Random Field to enforce both the spatial smoothness and the co-occurrence priors of the hand part labels to improve per-pixel classification results, which proves more superior to pixel-level filtering. In addition, the method generalizes well to human body parsing. In the follow-up work on hand pose estimation, the parsed hand parts prove to be effective for discriminative pose prediction by enforcing the correlations among the pose parameters. A hand gesture recognition method is proposed to take advantage of the parsed hand parts, which reports high accuracy and robustness to hand rotation. 2. A model-based framework for hand pose estimation with continuous depth image sequence input. It adopts a divide-and-conquer scheme and combines fingertip tracking and articulated iterative closest point approach to recover the hand pose. To track the fingertip robustly, we propose several novel depth features to differentiate the fingertip and non-fingertip points and utilize the particle filter to track the fingertips through successive frames. Compared to previous methods, our proposed method can locate the fingertip position for each of the five fingertips accurately for relatively complex hand configurations. For hand pose inference, the tracked fingertip positions provide an initial estimate and an articulated ICP algorithm is utilized for further refinement. 3. A discriminative framework to predict the 3D hand joint positions from a single depth image, which addresses the self-occlusion issue encountered in our model-based framework. It enforces the hand part correlations to improve the regression forest based methods from two different aspects. First, the hand parts are utilized as the additional cue for regression. Second, a Multi-modal Prediction Fusion algorithm is proposed to fuse the ambiguous per-pixel predictions within the low dimensional hand pose manifold. This method improves the prediction accuracy considerably compared to the competing methods and is especially effective in handling the discrepancies between the synthesized training data and real-world inputs. These proposed approaches are all capable of running in real-time or near real-time. To further exploit their potential in human-computer interaction, various applications are developed, such as hand-based communication with the virtual avatar and virtual object manipulation. DOCTOR OF PHILOSOPHY (EEE) 2015-12-23T06:16:43Z 2015-12-23T06:16:43Z 2015 2015 Thesis Liang, H. (2015). Vision-based hand pose estimation and gesture recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/65842 10.32657/10356/65842 en 140 p. application/pdf |