Text2Human: text-driven controllable human image generation

Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controlla...

全面介紹

Saved in:

書目詳細資料
Main Authors:	Jiang, Yuming, Yang, Shuai, Qju, Haonan, Wu, Wayne, Loy, Chen Change, Liu, Ziwei
其他作者:	School of Computer Science and Engineering
格式:	Article
語言:	English
出版:	2022
主題:	Engineering::Computer science and engineering Text-Driven Generation Image Generation
在線閱讀:	https://hdl.handle.net/10356/163319
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-163319
record_format	dspace
spelling	sg-ntu-dr.10356-1633192022-12-02T02:37:00Z Text2Human: text-driven controllable human image generation Jiang, Yuming Yang, Shuai Qju, Haonan Wu, Wayne Loy, Chen Change Liu, Ziwei School of Computer Science and Engineering S-Lab for Advanced Intelligence Engineering::Computer science and engineering Text-Driven Generation Image Generation Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods. Our project page is https://yumingj.github.io/projects/Text2Human.html. Code and pretrained models are available at https://github.com/yumingj/Text2Human. Ministry of Education (MOE) This study is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088), and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). 2022-12-02T02:36:59Z 2022-12-02T02:36:59Z 2022 Journal Article Jiang, Y., Yang, S., Qju, H., Wu, W., Loy, C. C. & Liu, Z. (2022). Text2Human: text-driven controllable human image generation. ACM Transactions On Graphics, 41(4), 162-. https://dx.doi.org/10.1145/3528223.3530104 0730-0301 https://hdl.handle.net/10356/163319 10.1145/3528223.3530104 2-s2.0-85135161067 4 41 162 en 2021-T1-001-088 IAF-ICP ACM Transactions on Graphics © 2022 Association for Computing Machinery. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Text-Driven Generation Image Generation
spellingShingle	Engineering::Computer science and engineering Text-Driven Generation Image Generation Jiang, Yuming Yang, Shuai Qju, Haonan Wu, Wayne Loy, Chen Change Liu, Ziwei Text2Human: text-driven controllable human image generation
description	Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods. Our project page is https://yumingj.github.io/projects/Text2Human.html. Code and pretrained models are available at https://github.com/yumingj/Text2Human.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Jiang, Yuming Yang, Shuai Qju, Haonan Wu, Wayne Loy, Chen Change Liu, Ziwei
format	Article
author	Jiang, Yuming Yang, Shuai Qju, Haonan Wu, Wayne Loy, Chen Change Liu, Ziwei
author_sort	Jiang, Yuming
title	Text2Human: text-driven controllable human image generation
title_short	Text2Human: text-driven controllable human image generation
title_full	Text2Human: text-driven controllable human image generation
title_fullStr	Text2Human: text-driven controllable human image generation
title_full_unstemmed	Text2Human: text-driven controllable human image generation
title_sort	text2human: text-driven controllable human image generation
publishDate	2022
url	https://hdl.handle.net/10356/163319
_version_	1751548589420576768

Text2Human: text-driven controllable human image generation

相似書籍