UniD3: unified discrete diffusion for simultaneous vision-language generation
The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can p...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172665 https://openreview.net/forum?id=8JqINxA-2a |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-172665 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1726652023-12-22T15:36:34Z UniD3: unified discrete diffusion for simultaneous vision-language generation Hu, Minghui Zheng, Chuanxia Cham, Tat-Jen Suganthan, Ponnuthurai Nagaratnam Yang, Zuopeng Zheng, Heliang Wang, Chaoyue Tao, Dacheng School of Computer Science and Engineering 2023 International Conference on Learning Representations (ICLR) Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Diffusion Computer Graphics The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks. Published version 2023-12-21T06:32:34Z 2023-12-21T06:32:34Z 2023 Conference Paper Hu, M., Zheng, C., Cham, T., Suganthan, P. N., Yang, Z., Zheng, H., Wang, C. & Tao, D. (2023). UniD3: unified discrete diffusion for simultaneous vision-language generation. 2023 International Conference on Learning Representations (ICLR), 1-23. https://hdl.handle.net/10356/172665 https://openreview.net/forum?id=8JqINxA-2a 1 23 en © 2023 The Author(s). All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at https://openreview.net/forum?id=8JqINxA-2a. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Diffusion Computer Graphics |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Diffusion Computer Graphics Hu, Minghui Zheng, Chuanxia Cham, Tat-Jen Suganthan, Ponnuthurai Nagaratnam Yang, Zuopeng Zheng, Heliang Wang, Chaoyue Tao, Dacheng UniD3: unified discrete diffusion for simultaneous vision-language generation |
description |
The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Hu, Minghui Zheng, Chuanxia Cham, Tat-Jen Suganthan, Ponnuthurai Nagaratnam Yang, Zuopeng Zheng, Heliang Wang, Chaoyue Tao, Dacheng |
format |
Conference or Workshop Item |
author |
Hu, Minghui Zheng, Chuanxia Cham, Tat-Jen Suganthan, Ponnuthurai Nagaratnam Yang, Zuopeng Zheng, Heliang Wang, Chaoyue Tao, Dacheng |
author_sort |
Hu, Minghui |
title |
UniD3: unified discrete diffusion for simultaneous vision-language generation |
title_short |
UniD3: unified discrete diffusion for simultaneous vision-language generation |
title_full |
UniD3: unified discrete diffusion for simultaneous vision-language generation |
title_fullStr |
UniD3: unified discrete diffusion for simultaneous vision-language generation |
title_full_unstemmed |
UniD3: unified discrete diffusion for simultaneous vision-language generation |
title_sort |
unid3: unified discrete diffusion for simultaneous vision-language generation |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/172665 https://openreview.net/forum?id=8JqINxA-2a |
_version_ |
1787136561595285504 |