Scaling up parametric human recovery

This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and sh...

全面介紹

Saved in:

書目詳細資料
主要作者:	Cai, Zhongang
其他作者:	Liu Ziwei
格式:	Thesis-Doctor of Philosophy
語言:	English
出版:	Nanyang Technological University 2024
主題:	Computer and Information Science Human pose and shape estimation Human parametric models Datasets Foundation models
在線閱讀:	https://hdl.handle.net/10356/178618
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-178618
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Human pose and shape estimation Human parametric models Datasets Foundation models
spellingShingle	Computer and Information Science Human pose and shape estimation Human parametric models Datasets Foundation models Cai, Zhongang Scaling up parametric human recovery
description	This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and shape recovery. A critical challenge in achieving this is the prohibitive cost of acquiring large-scale, diverse datasets, which limits the accuracy and generalizability of state-of-the-art methods. To overcome this, the thesis presents a cohesive series of innovative studies aimed at resolving data scarcity. These studies culminate in a comprehensive investigation that utilizes an extensive range of data to explore scaling laws in parametric human recovery. Collectively, these efforts mark a significant advancement in the field through scaling up toward an unprecedentedly robust, high-performance parametric human recovery in the wild. First, we developed a low-cost data collection facility for massive human data acquisition. Advances in sensors and algorithms enable the collection of paired data through inexpensive setups and automated annotation pipelines. HuMMan, a mega-scale multi-modal 4D human dataset, exemplifies this with its 1000 human subjects, 400K sequences, and 60M frames. HuMMan’s features include: 1) multimodal data and annotations, 2) integration of popular mobile devices in the sensor suite, 3) a comprehensive set of 500 actions, and 4) support for various tasks like action recognition, pose estimation, and especially parametric human recovery. However, in a subsequent study, we discovered that despite the diverse subjects and actions, HuMMan’s consistent background limits its generalizability. Second, we introduce GTA-Human, a large-scale 3D human dataset derived from the popular video game GTA-V. This dataset is notable for its diversity in subjects, actions, and scenarios (which HuMMan lacks), obtained through gameplay with automatically annotated 3D ground truths. Our findings reveal that: 1) game-playing data is remarkably effective, 2) synthetic data provides essential supplements to real data, 3) the scale of the dataset is crucial, 4) strong supervision labels are key, and 5) synthetic data enhances the performance of larger models. However, GTA-Human’s diversity is ultimately constrained by the asset database available in the game. Third, we developed SynBody, a synthetic dataset created using the Unreal Engine, where we have full control over the assets, to further diversify human models and enhance annotation quality. SynBody’s highlights include: 1) a clothed parametric human model generating diverse subjects, 2) a layered human representation for high-quality 3D annotations, and 3) a scalable system producing 1.2M frames of realistic data. Experiments on SynBody indicate substantial improvements in parametric human recovery tasks. Finally, with the availability of massive datasets, we explore scaling up expressive human pose and shape estimation (EHPS) through SMPLer-X, a generalist foundation model. Utilizing a Vision Transformer as the backbone and training on 4.5M instances from diverse datasets, SMPLer-X demonstrates robust performance and transferability. Our study on data scaling leads to an optimized training scheme, resulting in a significant leap in EHPS capabilities. We leverage Vision Transformers to examine the scaling law of model sizes in EHPS. Additionally, our fine-tuning strategy evolves SMPLer-X into specialist models, further boosting performance and achieving state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and EHF. In conclusion, this thesis pioneers the development of the first foundation model for human parametric pose and shape recovery, addressing critical challenges in data acquisition and generalizability. By developing innovative datasets and exploring data and model scaling laws, the research enhances the accuracy and robustness of parametric human models. This work not only marks a significant advancement in the field but also sets a new standard for future research in human pose and shape estimation.
author2	Liu Ziwei
author_facet	Liu Ziwei Cai, Zhongang
format	Thesis-Doctor of Philosophy
author	Cai, Zhongang
author_sort	Cai, Zhongang
title	Scaling up parametric human recovery
title_short	Scaling up parametric human recovery
title_full	Scaling up parametric human recovery
title_fullStr	Scaling up parametric human recovery
title_full_unstemmed	Scaling up parametric human recovery
title_sort	scaling up parametric human recovery
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/178618
_version_	1814047413707472896
spelling	sg-ntu-dr.10356-1786182024-08-01T08:11:46Z Scaling up parametric human recovery Cai, Zhongang Liu Ziwei College of Computing and Data Science ziwei.liu@ntu.edu.sg Computer and Information Science Human pose and shape estimation Human parametric models Datasets Foundation models This thesis addresses the fundamental tasks of parametric human recovery (i.e. human pose and shape estimation with parametric human models) from monocular images or videos, a field with wide-ranging applications. The key goal is to develop the first foundation model for human parametric pose and shape recovery. A critical challenge in achieving this is the prohibitive cost of acquiring large-scale, diverse datasets, which limits the accuracy and generalizability of state-of-the-art methods. To overcome this, the thesis presents a cohesive series of innovative studies aimed at resolving data scarcity. These studies culminate in a comprehensive investigation that utilizes an extensive range of data to explore scaling laws in parametric human recovery. Collectively, these efforts mark a significant advancement in the field through scaling up toward an unprecedentedly robust, high-performance parametric human recovery in the wild. First, we developed a low-cost data collection facility for massive human data acquisition. Advances in sensors and algorithms enable the collection of paired data through inexpensive setups and automated annotation pipelines. HuMMan, a mega-scale multi-modal 4D human dataset, exemplifies this with its 1000 human subjects, 400K sequences, and 60M frames. HuMMan’s features include: 1) multimodal data and annotations, 2) integration of popular mobile devices in the sensor suite, 3) a comprehensive set of 500 actions, and 4) support for various tasks like action recognition, pose estimation, and especially parametric human recovery. However, in a subsequent study, we discovered that despite the diverse subjects and actions, HuMMan’s consistent background limits its generalizability. Second, we introduce GTA-Human, a large-scale 3D human dataset derived from the popular video game GTA-V. This dataset is notable for its diversity in subjects, actions, and scenarios (which HuMMan lacks), obtained through gameplay with automatically annotated 3D ground truths. Our findings reveal that: 1) game-playing data is remarkably effective, 2) synthetic data provides essential supplements to real data, 3) the scale of the dataset is crucial, 4) strong supervision labels are key, and 5) synthetic data enhances the performance of larger models. However, GTA-Human’s diversity is ultimately constrained by the asset database available in the game. Third, we developed SynBody, a synthetic dataset created using the Unreal Engine, where we have full control over the assets, to further diversify human models and enhance annotation quality. SynBody’s highlights include: 1) a clothed parametric human model generating diverse subjects, 2) a layered human representation for high-quality 3D annotations, and 3) a scalable system producing 1.2M frames of realistic data. Experiments on SynBody indicate substantial improvements in parametric human recovery tasks. Finally, with the availability of massive datasets, we explore scaling up expressive human pose and shape estimation (EHPS) through SMPLer-X, a generalist foundation model. Utilizing a Vision Transformer as the backbone and training on 4.5M instances from diverse datasets, SMPLer-X demonstrates robust performance and transferability. Our study on data scaling leads to an optimized training scheme, resulting in a significant leap in EHPS capabilities. We leverage Vision Transformers to examine the scaling law of model sizes in EHPS. Additionally, our fine-tuning strategy evolves SMPLer-X into specialist models, further boosting performance and achieving state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and EHF. In conclusion, this thesis pioneers the development of the first foundation model for human parametric pose and shape recovery, addressing critical challenges in data acquisition and generalizability. By developing innovative datasets and exploring data and model scaling laws, the research enhances the accuracy and robustness of parametric human models. This work not only marks a significant advancement in the field but also sets a new standard for future research in human pose and shape estimation. Doctor of Philosophy 2024-07-01T07:21:54Z 2024-07-01T07:21:54Z 2024 Thesis-Doctor of Philosophy Cai, Z. (2024). Scaling up parametric human recovery. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178618 https://hdl.handle.net/10356/178618 10.32657/10356/178618 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Scaling up parametric human recovery

相似書籍