3D hand estimation under egocentic vision

This dissertation introduces a novel two-stage transformer-based model for 3D hand pose estimation, specifically designed for egocentric conditions. The proposed architecture integrates a FastViT-ma36 backbone in the first stage, which efficiently extracts features from monocular RGB images. In the...

Full description

Saved in:
Bibliographic Details
Main Author: Zhu, Yixiang
Other Authors: Yap Kim Hui
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182401
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This dissertation introduces a novel two-stage transformer-based model for 3D hand pose estimation, specifically designed for egocentric conditions. The proposed architecture integrates a FastViT-ma36 backbone in the first stage, which efficiently extracts features from monocular RGB images. In the second stage, three transformer encoder layers are employed to refine pose accuracy by capturing essential spatial relationships between hand joints. This two-stage design ensures effective feature extraction and contextual awareness, addressing challenges such as occlusion and partial visibility. Our model significantly improves the accuracy of 3D hand pose estimation, achieving an area under the curve (AUC) of 0.87 on the FPHA dataset, compared to 0.76 in previous state-of-the-art methods. This enhancement demonstrates the effectiveness of the proposed architecture, with optimized feature extraction and transformer-based processing leading to substantial gains in pose estimation accuracy. Additionally, the model is robust to occlusions, maintaining high accuracy even under challenging self-occlusion and object-occlusion scenarios. The model achieves a real-time processing speed of over 200 frames per second (fps) on the FPHA dataset, making it a promising solution for highprecision, real-time hand pose estimation in practical scenarios.