Freestyle layout-to-image synthesis

Typical layout-to-image synthesis (LIS) models generate images for a close set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a gi...

Full description

Saved in:
Bibliographic Details
Main Authors: XUE, Han, HUANG, Zhiwu, SUN, Qianru, SONG, Li, ZHANG, Wenjun
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8057
https://ink.library.smu.edu.sg/context/sis_research/article/9060/viewcontent/Xue_Freestyle_Layout_to_Image_Synthesis_CVPR_2023_paper.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9060
record_format dspace
spelling sg-smu-ink.sis_research-90602023-09-07T08:06:59Z Freestyle layout-to-image synthesis XUE, Han HUANG, Zhiwu SUN, Qianru SONG, Li ZHANG, Wenjun Typical layout-to-image synthesis (LIS) models generate images for a close set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). 2023-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8057 https://ink.library.smu.edu.sg/context/sis_research/article/9060/viewcontent/Xue_Freestyle_Layout_to_Image_Synthesis_CVPR_2023_paper.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle Databases and Information Systems
Graphics and Human Computer Interfaces
XUE, Han
HUANG, Zhiwu
SUN, Qianru
SONG, Li
ZHANG, Wenjun
Freestyle layout-to-image synthesis
description Typical layout-to-image synthesis (LIS) models generate images for a close set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific).
format text
author XUE, Han
HUANG, Zhiwu
SUN, Qianru
SONG, Li
ZHANG, Wenjun
author_facet XUE, Han
HUANG, Zhiwu
SUN, Qianru
SONG, Li
ZHANG, Wenjun
author_sort XUE, Han
title Freestyle layout-to-image synthesis
title_short Freestyle layout-to-image synthesis
title_full Freestyle layout-to-image synthesis
title_fullStr Freestyle layout-to-image synthesis
title_full_unstemmed Freestyle layout-to-image synthesis
title_sort freestyle layout-to-image synthesis
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8057
https://ink.library.smu.edu.sg/context/sis_research/article/9060/viewcontent/Xue_Freestyle_Layout_to_Image_Synthesis_CVPR_2023_paper.pdf
_version_ 1779157093946753024