Learning decoupled models for cross-modal generation
Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text gene...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/169609 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-169609 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1696092023-08-01T07:08:34Z Learning decoupled models for cross-modal generation Wang, Hao Miao Chun Yan School of Computer Science and Engineering ASCYMiao@ntu.edu.sg Engineering::Computer science and engineering Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text generation task, most existing methods adopt a pretrained object detection model to extract the image object features, from which they generate textual descriptions. However, the pretrained model cannot always produce correct results for different domain data, hence the generated captions may fail to faithfully present all the visual contents. For the text-to-visual generation task, the traditional way is to use the text-conditioned Generative Adversarial Network (GAN) architecture to generate images, where the image generation training and cross-modal similarity learning are coupled. This may decrease image generation quality and diversity. In this thesis, we focus on two main research questions. Firstly, in the visual-to-text generation task, how can we learn decoupled models for food image and complex video datasets, which contain mixed ingredients and domain-specific object classes that are not included during the object detection model pretraining? Secondly, in the text-to-visual generation task, how can we decouple the image generation training and cross-modal similarity learning, so that the text-guided image generation and manipulation tasks can be conducted in the same framework and increase the generation quality? In order to tackle these research questions, we propose learning decoupled models for the cross-modal generation tasks. Compared with commonly coupled model architectures, decoupling different model components enables each of them to be learned effectively, so that the source modality can be translated to the target modality more easily. Doctor of Philosophy 2023-07-26T05:58:06Z 2023-07-26T05:58:06Z 2023 Thesis-Doctor of Philosophy Wang, H. (2023). Learning decoupled models for cross-modal generation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169609 https://hdl.handle.net/10356/169609 10.32657/10356/169609 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering |
spellingShingle |
Engineering::Computer science and engineering Wang, Hao Learning decoupled models for cross-modal generation |
description |
Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text generation task, most existing methods adopt a pretrained object detection model to extract the image object features, from which they generate textual descriptions. However, the pretrained model cannot always produce correct results for different domain data, hence the generated captions may fail to faithfully present all the visual contents. For the text-to-visual generation task, the traditional way is to use the text-conditioned Generative Adversarial Network (GAN) architecture to generate images, where the image generation training and cross-modal similarity learning are coupled. This may decrease image generation quality and diversity. In this thesis, we focus on two main research questions. Firstly, in the visual-to-text generation task, how can we learn decoupled models for food image and complex video datasets, which contain mixed ingredients and domain-specific object classes that are not included during the object detection model pretraining? Secondly, in the text-to-visual generation task, how can we decouple the image generation training and cross-modal similarity learning, so that the text-guided image generation and manipulation tasks can be conducted in the same framework and increase the generation quality? In order to tackle these research questions, we propose learning decoupled models for the cross-modal generation tasks. Compared with commonly coupled model architectures, decoupling different model components enables each of them to be learned effectively, so that the source modality can be translated to the target modality more easily. |
author2 |
Miao Chun Yan |
author_facet |
Miao Chun Yan Wang, Hao |
format |
Thesis-Doctor of Philosophy |
author |
Wang, Hao |
author_sort |
Wang, Hao |
title |
Learning decoupled models for cross-modal generation |
title_short |
Learning decoupled models for cross-modal generation |
title_full |
Learning decoupled models for cross-modal generation |
title_fullStr |
Learning decoupled models for cross-modal generation |
title_full_unstemmed |
Learning decoupled models for cross-modal generation |
title_sort |
learning decoupled models for cross-modal generation |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/169609 |
_version_ |
1773551247919415296 |