Scaling up parametric human recovery

This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and sh...

Full description

Saved in:
Bibliographic Details
Main Author: Cai, Zhongang
Other Authors: Liu Ziwei
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/178618
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and shape recovery. A critical challenge in achieving this is the prohibitive cost of acquiring large-scale, diverse datasets, which limits the accuracy and generalizability of state-of-the-art methods. To overcome this, the thesis presents a cohesive series of innovative studies aimed at resolving data scarcity. These studies culminate in a comprehensive investigation that utilizes an extensive range of data to explore scaling laws in parametric human recovery. Collectively, these efforts mark a significant advancement in the field through scaling up toward an unprecedentedly robust, high-performance parametric human recovery in the wild. First, we developed a low-cost data collection facility for massive human data acquisition. Advances in sensors and algorithms enable the collection of paired data through inexpensive setups and automated annotation pipelines. HuMMan, a mega-scale multi-modal 4D human dataset, exemplifies this with its 1000 human subjects, 400K sequences, and 60M frames. HuMMan’s features include: 1) multimodal data and annotations, 2) integration of popular mobile devices in the sensor suite, 3) a comprehensive set of 500 actions, and 4) support for various tasks like action recognition, pose estimation, and especially parametric human recovery. However, in a subsequent study, we discovered that despite the diverse subjects and actions, HuMMan’s consistent background limits its generalizability. Second, we introduce GTA-Human, a large-scale 3D human dataset derived from the popular video game GTA-V. This dataset is notable for its diversity in subjects, actions, and scenarios (which HuMMan lacks), obtained through gameplay with automatically annotated 3D ground truths. Our findings reveal that: 1) game-playing data is remarkably effective, 2) synthetic data provides essential supplements to real data, 3) the scale of the dataset is crucial, 4) strong supervision labels are key, and 5) synthetic data enhances the performance of larger models. However, GTA-Human’s diversity is ultimately constrained by the asset database available in the game. Third, we developed SynBody, a synthetic dataset created using the Unreal Engine, where we have full control over the assets, to further diversify human models and enhance annotation quality. SynBody’s highlights include: 1) a clothed parametric human model generating diverse subjects, 2) a layered human representation for high-quality 3D annotations, and 3) a scalable system producing 1.2M frames of realistic data. Experiments on SynBody indicate substantial improvements in parametric human recovery tasks. Finally, with the availability of massive datasets, we explore scaling up expressive human pose and shape estimation (EHPS) through SMPLer-X, a generalist foundation model. Utilizing a Vision Transformer as the backbone and training on 4.5M instances from diverse datasets, SMPLer-X demonstrates robust performance and transferability. Our study on data scaling leads to an optimized training scheme, resulting in a significant leap in EHPS capabilities. We leverage Vision Transformers to examine the scaling law of model sizes in EHPS. Additionally, our fine-tuning strategy evolves SMPLer-X into specialist models, further boosting performance and achieving state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and EHF. In conclusion, this thesis pioneers the development of the first foundation model for human parametric pose and shape recovery, addressing critical challenges in data acquisition and generalizability. By developing innovative datasets and exploring data and model scaling laws, the research enhances the accuracy and robustness of parametric human models. This work not only marks a significant advancement in the field but also sets a new standard for future research in human pose and shape estimation.