Scaling up parametric human recovery

This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and sh...

Full description

Saved in:
Bibliographic Details
Main Author: Cai, Zhongang
Other Authors: Liu Ziwei
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/178618
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-178618
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Human pose and shape estimation
Human parametric models
Datasets
Foundation models
spellingShingle Computer and Information Science
Human pose and shape estimation
Human parametric models
Datasets
Foundation models
Cai, Zhongang
Scaling up parametric human recovery
description This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and shape recovery. A critical challenge in achieving this is the prohibitive cost of acquiring large-scale, diverse datasets, which limits the accuracy and generalizability of state-of-the-art methods. To overcome this, the thesis presents a cohesive series of innovative studies aimed at resolving data scarcity. These studies culminate in a comprehensive investigation that utilizes an extensive range of data to explore scaling laws in parametric human recovery. Collectively, these efforts mark a significant advancement in the field through scaling up toward an unprecedentedly robust, high-performance parametric human recovery in the wild. First, we developed a low-cost data collection facility for massive human data acquisition. Advances in sensors and algorithms enable the collection of paired data through inexpensive setups and automated annotation pipelines. HuMMan, a mega-scale multi-modal 4D human dataset, exemplifies this with its 1000 human subjects, 400K sequences, and 60M frames. HuMMan’s features include: 1) multimodal data and annotations, 2) integration of popular mobile devices in the sensor suite, 3) a comprehensive set of 500 actions, and 4) support for various tasks like action recognition, pose estimation, and especially parametric human recovery. However, in a subsequent study, we discovered that despite the diverse subjects and actions, HuMMan’s consistent background limits its generalizability. Second, we introduce GTA-Human, a large-scale 3D human dataset derived from the popular video game GTA-V. This dataset is notable for its diversity in subjects, actions, and scenarios (which HuMMan lacks), obtained through gameplay with automatically annotated 3D ground truths. Our findings reveal that: 1) game-playing data is remarkably effective, 2) synthetic data provides essential supplements to real data, 3) the scale of the dataset is crucial, 4) strong supervision labels are key, and 5) synthetic data enhances the performance of larger models. However, GTA-Human’s diversity is ultimately constrained by the asset database available in the game. Third, we developed SynBody, a synthetic dataset created using the Unreal Engine, where we have full control over the assets, to further diversify human models and enhance annotation quality. SynBody’s highlights include: 1) a clothed parametric human model generating diverse subjects, 2) a layered human representation for high-quality 3D annotations, and 3) a scalable system producing 1.2M frames of realistic data. Experiments on SynBody indicate substantial improvements in parametric human recovery tasks. Finally, with the availability of massive datasets, we explore scaling up expressive human pose and shape estimation (EHPS) through SMPLer-X, a generalist foundation model. Utilizing a Vision Transformer as the backbone and training on 4.5M instances from diverse datasets, SMPLer-X demonstrates robust performance and transferability. Our study on data scaling leads to an optimized training scheme, resulting in a significant leap in EHPS capabilities. We leverage Vision Transformers to examine the scaling law of model sizes in EHPS. Additionally, our fine-tuning strategy evolves SMPLer-X into specialist models, further boosting performance and achieving state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and EHF. In conclusion, this thesis pioneers the development of the first foundation model for human parametric pose and shape recovery, addressing critical challenges in data acquisition and generalizability. By developing innovative datasets and exploring data and model scaling laws, the research enhances the accuracy and robustness of parametric human models. This work not only marks a significant advancement in the field but also sets a new standard for future research in human pose and shape estimation.
author2 Liu Ziwei
author_facet Liu Ziwei
Cai, Zhongang
format Thesis-Doctor of Philosophy
author Cai, Zhongang
author_sort Cai, Zhongang
title Scaling up parametric human recovery
title_short Scaling up parametric human recovery
title_full Scaling up parametric human recovery
title_fullStr Scaling up parametric human recovery
title_full_unstemmed Scaling up parametric human recovery
title_sort scaling up parametric human recovery
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/178618
_version_ 1814047413707472896
spelling sg-ntu-dr.10356-1786182024-08-01T08:11:46Z Scaling up parametric human recovery Cai, Zhongang Liu Ziwei College of Computing and Data Science ziwei.liu@ntu.edu.sg Computer and Information Science Human pose and shape estimation Human parametric models Datasets Foundation models This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and shape recovery. A critical challenge in achieving this is the prohibitive cost of acquiring large-scale, diverse datasets, which limits the accuracy and generalizability of state-of-the-art methods. To overcome this, the thesis presents a cohesive series of innovative studies aimed at resolving data scarcity. These studies culminate in a comprehensive investigation that utilizes an extensive range of data to explore scaling laws in parametric human recovery. Collectively, these efforts mark a significant advancement in the field through scaling up toward an unprecedentedly robust, high-performance parametric human recovery in the wild. First, we developed a low-cost data collection facility for massive human data acquisition. Advances in sensors and algorithms enable the collection of paired data through inexpensive setups and automated annotation pipelines. HuMMan, a mega-scale multi-modal 4D human dataset, exemplifies this with its 1000 human subjects, 400K sequences, and 60M frames. HuMMan’s features include: 1) multimodal data and annotations, 2) integration of popular mobile devices in the sensor suite, 3) a comprehensive set of 500 actions, and 4) support for various tasks like action recognition, pose estimation, and especially parametric human recovery. However, in a subsequent study, we discovered that despite the diverse subjects and actions, HuMMan’s consistent background limits its generalizability. Second, we introduce GTA-Human, a large-scale 3D human dataset derived from the popular video game GTA-V. This dataset is notable for its diversity in subjects, actions, and scenarios (which HuMMan lacks), obtained through gameplay with automatically annotated 3D ground truths. Our findings reveal that: 1) game-playing data is remarkably effective, 2) synthetic data provides essential supplements to real data, 3) the scale of the dataset is crucial, 4) strong supervision labels are key, and 5) synthetic data enhances the performance of larger models. However, GTA-Human’s diversity is ultimately constrained by the asset database available in the game. Third, we developed SynBody, a synthetic dataset created using the Unreal Engine, where we have full control over the assets, to further diversify human models and enhance annotation quality. SynBody’s highlights include: 1) a clothed parametric human model generating diverse subjects, 2) a layered human representation for high-quality 3D annotations, and 3) a scalable system producing 1.2M frames of realistic data. Experiments on SynBody indicate substantial improvements in parametric human recovery tasks. Finally, with the availability of massive datasets, we explore scaling up expressive human pose and shape estimation (EHPS) through SMPLer-X, a generalist foundation model. Utilizing a Vision Transformer as the backbone and training on 4.5M instances from diverse datasets, SMPLer-X demonstrates robust performance and transferability. Our study on data scaling leads to an optimized training scheme, resulting in a significant leap in EHPS capabilities. We leverage Vision Transformers to examine the scaling law of model sizes in EHPS. Additionally, our fine-tuning strategy evolves SMPLer-X into specialist models, further boosting performance and achieving state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and EHF. In conclusion, this thesis pioneers the development of the first foundation model for human parametric pose and shape recovery, addressing critical challenges in data acquisition and generalizability. By developing innovative datasets and exploring data and model scaling laws, the research enhances the accuracy and robustness of parametric human models. This work not only marks a significant advancement in the field but also sets a new standard for future research in human pose and shape estimation. Doctor of Philosophy 2024-07-01T07:21:54Z 2024-07-01T07:21:54Z 2024 Thesis-Doctor of Philosophy Cai, Z. (2024). Scaling up parametric human recovery. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178618 https://hdl.handle.net/10356/178618 10.32657/10356/178618 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University