UniD3: unified discrete diffusion for simultaneous vision-language generation

The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can p...

Full description

Saved in:
Bibliographic Details
Main Authors: Hu, Minghui, Zheng, Chuanxia, Cham, Tat-Jen, Suganthan, Ponnuthurai Nagaratnam, Yang, Zuopeng, Zheng, Heliang, Wang, Chaoyue, Tao, Dacheng
Other Authors: School of Computer Science and Engineering
Format: Conference or Workshop Item
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172665
https://openreview.net/forum?id=8JqINxA-2a
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-172665
record_format dspace
spelling sg-ntu-dr.10356-1726652023-12-22T15:36:34Z UniD3: unified discrete diffusion for simultaneous vision-language generation Hu, Minghui Zheng, Chuanxia Cham, Tat-Jen Suganthan, Ponnuthurai Nagaratnam Yang, Zuopeng Zheng, Heliang Wang, Chaoyue Tao, Dacheng School of Computer Science and Engineering 2023 International Conference on Learning Representations (ICLR) Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Diffusion Computer Graphics The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks. Published version 2023-12-21T06:32:34Z 2023-12-21T06:32:34Z 2023 Conference Paper Hu, M., Zheng, C., Cham, T., Suganthan, P. N., Yang, Z., Zheng, H., Wang, C. & Tao, D. (2023). UniD3: unified discrete diffusion for simultaneous vision-language generation. 2023 International Conference on Learning Representations (ICLR), 1-23. https://hdl.handle.net/10356/172665 https://openreview.net/forum?id=8JqINxA-2a 1 23 en © 2023 The Author(s). All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at https://openreview.net/forum?id=8JqINxA-2a. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Diffusion
Computer Graphics
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Diffusion
Computer Graphics
Hu, Minghui
Zheng, Chuanxia
Cham, Tat-Jen
Suganthan, Ponnuthurai Nagaratnam
Yang, Zuopeng
Zheng, Heliang
Wang, Chaoyue
Tao, Dacheng
UniD3: unified discrete diffusion for simultaneous vision-language generation
description The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Hu, Minghui
Zheng, Chuanxia
Cham, Tat-Jen
Suganthan, Ponnuthurai Nagaratnam
Yang, Zuopeng
Zheng, Heliang
Wang, Chaoyue
Tao, Dacheng
format Conference or Workshop Item
author Hu, Minghui
Zheng, Chuanxia
Cham, Tat-Jen
Suganthan, Ponnuthurai Nagaratnam
Yang, Zuopeng
Zheng, Heliang
Wang, Chaoyue
Tao, Dacheng
author_sort Hu, Minghui
title UniD3: unified discrete diffusion for simultaneous vision-language generation
title_short UniD3: unified discrete diffusion for simultaneous vision-language generation
title_full UniD3: unified discrete diffusion for simultaneous vision-language generation
title_fullStr UniD3: unified discrete diffusion for simultaneous vision-language generation
title_full_unstemmed UniD3: unified discrete diffusion for simultaneous vision-language generation
title_sort unid3: unified discrete diffusion for simultaneous vision-language generation
publishDate 2023
url https://hdl.handle.net/10356/172665
https://openreview.net/forum?id=8JqINxA-2a
_version_ 1787136561595285504