Masked diffusion transformer is a strong image synthesizer

Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that i...

Full description

Saved in:
Bibliographic Details
Main Authors: GAO, Shanghua, ZHOU, Pan, CHENG, Ming-Ming, YAN, Shuicheng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9024
https://ink.library.smu.edu.sg/context/sis_research/article/10027/viewcontent/2023_ICCV_MDT.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10027
record_format dspace
spelling sg-smu-ink.sis_research-100272024-07-25T08:04:24Z Masked diffusion transformer is a strong image synthesizer GAO, Shanghua ZHOU, Pan CHENG, Ming-Ming YAN, Shuicheng Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs’ ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric masking diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. Experimental results show that MDT achieves superior image synthesis performance, e.g., a new SOTA FID score in the ImageNet data set, and has about 3× faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT. 2023-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9024 info:doi/10.1109/ICCV51070.2023.02117 https://ink.library.smu.edu.sg/context/sis_research/article/10027/viewcontent/2023_ICCV_MDT.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Training Representation learning Image synthesis Computational modeling Synthesizers Source coding Semantics Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Training
Representation learning
Image synthesis
Computational modeling
Synthesizers
Source coding
Semantics
Graphics and Human Computer Interfaces
spellingShingle Training
Representation learning
Image synthesis
Computational modeling
Synthesizers
Source coding
Semantics
Graphics and Human Computer Interfaces
GAO, Shanghua
ZHOU, Pan
CHENG, Ming-Ming
YAN, Shuicheng
Masked diffusion transformer is a strong image synthesizer
description Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs’ ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric masking diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. Experimental results show that MDT achieves superior image synthesis performance, e.g., a new SOTA FID score in the ImageNet data set, and has about 3× faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT.
format text
author GAO, Shanghua
ZHOU, Pan
CHENG, Ming-Ming
YAN, Shuicheng
author_facet GAO, Shanghua
ZHOU, Pan
CHENG, Ming-Ming
YAN, Shuicheng
author_sort GAO, Shanghua
title Masked diffusion transformer is a strong image synthesizer
title_short Masked diffusion transformer is a strong image synthesizer
title_full Masked diffusion transformer is a strong image synthesizer
title_fullStr Masked diffusion transformer is a strong image synthesizer
title_full_unstemmed Masked diffusion transformer is a strong image synthesizer
title_sort masked diffusion transformer is a strong image synthesizer
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/9024
https://ink.library.smu.edu.sg/context/sis_research/article/10027/viewcontent/2023_ICCV_MDT.pdf
_version_ 1814047711011274752