Learning decoupled models for cross-modal generation

Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text gene...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Hao
Other Authors: Miao Chun Yan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169609
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text generation task, most existing methods adopt a pretrained object detection model to extract the image object features, from which they generate textual descriptions. However, the pretrained model cannot always produce correct results for different domain data, hence the generated captions may fail to faithfully present all the visual contents. For the text-to-visual generation task, the traditional way is to use the text-conditioned Generative Adversarial Network (GAN) architecture to generate images, where the image generation training and cross-modal similarity learning are coupled. This may decrease image generation quality and diversity. In this thesis, we focus on two main research questions. Firstly, in the visual-to-text generation task, how can we learn decoupled models for food image and complex video datasets, which contain mixed ingredients and domain-specific object classes that are not included during the object detection model pretraining? Secondly, in the text-to-visual generation task, how can we decouple the image generation training and cross-modal similarity learning, so that the text-guided image generation and manipulation tasks can be conducted in the same framework and increase the generation quality? In order to tackle these research questions, we propose learning decoupled models for the cross-modal generation tasks. Compared with commonly coupled model architectures, decoupling different model components enables each of them to be learned effectively, so that the source modality can be translated to the target modality more easily.