Learning decoupled models for cross-modal generation
Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text gene...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/169609 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text generation task, most existing methods adopt a pretrained object detection model to extract the image object features, from which they generate textual descriptions. However, the pretrained model cannot always produce correct results for different domain data, hence the generated captions may fail to faithfully present all the visual contents. For the text-to-visual generation task, the traditional way is to use the text-conditioned Generative Adversarial Network (GAN) architecture to generate images, where the image generation training and cross-modal similarity learning are coupled. This may decrease image generation quality and diversity. In this thesis, we focus on two main research questions. Firstly, in the visual-to-text generation task, how can we learn decoupled models for food image and complex video datasets, which contain mixed ingredients and domain-specific object classes that are not included during the object detection model pretraining? Secondly, in the text-to-visual generation task, how can we decouple the image generation training and cross-modal similarity learning, so that the text-guided image generation and manipulation tasks can be conducted in the same framework and increase the generation quality? In order to tackle these research questions, we propose learning decoupled models for the cross-modal generation tasks. Compared with commonly coupled model architectures, decoupling different model components enables each of them to be learned effectively, so that the source modality can be translated to the target modality more easily. |
---|