UniD3: unified discrete diffusion for simultaneous vision-language generation

The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can p...

Full description

Saved in:
Bibliographic Details
Main Authors: Hu, Minghui, Zheng, Chuanxia, Cham, Tat-Jen, Suganthan, Ponnuthurai Nagaratnam, Yang, Zuopeng, Zheng, Heliang, Wang, Chaoyue, Tao, Dacheng
Other Authors: School of Computer Science and Engineering
Format: Conference or Workshop Item
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172665
https://openreview.net/forum?id=8JqINxA-2a
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.