Learning decoupled models for cross-modal generation

Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text gene...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Hao
Other Authors: Miao Chun Yan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169609
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-169609
record_format dspace
spelling sg-ntu-dr.10356-1696092023-08-01T07:08:34Z Learning decoupled models for cross-modal generation Wang, Hao Miao Chun Yan School of Computer Science and Engineering ASCYMiao@ntu.edu.sg Engineering::Computer science and engineering Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text generation task, most existing methods adopt a pretrained object detection model to extract the image object features, from which they generate textual descriptions. However, the pretrained model cannot always produce correct results for different domain data, hence the generated captions may fail to faithfully present all the visual contents. For the text-to-visual generation task, the traditional way is to use the text-conditioned Generative Adversarial Network (GAN) architecture to generate images, where the image generation training and cross-modal similarity learning are coupled. This may decrease image generation quality and diversity. In this thesis, we focus on two main research questions. Firstly, in the visual-to-text generation task, how can we learn decoupled models for food image and complex video datasets, which contain mixed ingredients and domain-specific object classes that are not included during the object detection model pretraining? Secondly, in the text-to-visual generation task, how can we decouple the image generation training and cross-modal similarity learning, so that the text-guided image generation and manipulation tasks can be conducted in the same framework and increase the generation quality? In order to tackle these research questions, we propose learning decoupled models for the cross-modal generation tasks. Compared with commonly coupled model architectures, decoupling different model components enables each of them to be learned effectively, so that the source modality can be translated to the target modality more easily. Doctor of Philosophy 2023-07-26T05:58:06Z 2023-07-26T05:58:06Z 2023 Thesis-Doctor of Philosophy Wang, H. (2023). Learning decoupled models for cross-modal generation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169609 https://hdl.handle.net/10356/169609 10.32657/10356/169609 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Wang, Hao
Learning decoupled models for cross-modal generation
description Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text generation task, most existing methods adopt a pretrained object detection model to extract the image object features, from which they generate textual descriptions. However, the pretrained model cannot always produce correct results for different domain data, hence the generated captions may fail to faithfully present all the visual contents. For the text-to-visual generation task, the traditional way is to use the text-conditioned Generative Adversarial Network (GAN) architecture to generate images, where the image generation training and cross-modal similarity learning are coupled. This may decrease image generation quality and diversity. In this thesis, we focus on two main research questions. Firstly, in the visual-to-text generation task, how can we learn decoupled models for food image and complex video datasets, which contain mixed ingredients and domain-specific object classes that are not included during the object detection model pretraining? Secondly, in the text-to-visual generation task, how can we decouple the image generation training and cross-modal similarity learning, so that the text-guided image generation and manipulation tasks can be conducted in the same framework and increase the generation quality? In order to tackle these research questions, we propose learning decoupled models for the cross-modal generation tasks. Compared with commonly coupled model architectures, decoupling different model components enables each of them to be learned effectively, so that the source modality can be translated to the target modality more easily.
author2 Miao Chun Yan
author_facet Miao Chun Yan
Wang, Hao
format Thesis-Doctor of Philosophy
author Wang, Hao
author_sort Wang, Hao
title Learning decoupled models for cross-modal generation
title_short Learning decoupled models for cross-modal generation
title_full Learning decoupled models for cross-modal generation
title_fullStr Learning decoupled models for cross-modal generation
title_full_unstemmed Learning decoupled models for cross-modal generation
title_sort learning decoupled models for cross-modal generation
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/169609
_version_ 1773551247919415296