Bridging images and natural language with deep learning
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks s...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2019
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/85399 http://hdl.handle.net/10220/50454 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-85399 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Gu, Jiuxiang Bridging images and natural language with deep learning |
description |
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval. |
author2 |
Cai Jianfei |
author_facet |
Cai Jianfei Gu, Jiuxiang |
format |
Theses and Dissertations |
author |
Gu, Jiuxiang |
author_sort |
Gu, Jiuxiang |
title |
Bridging images and natural language with deep learning |
title_short |
Bridging images and natural language with deep learning |
title_full |
Bridging images and natural language with deep learning |
title_fullStr |
Bridging images and natural language with deep learning |
title_full_unstemmed |
Bridging images and natural language with deep learning |
title_sort |
bridging images and natural language with deep learning |
publishDate |
2019 |
url |
https://hdl.handle.net/10356/85399 http://hdl.handle.net/10220/50454 |
_version_ |
1683494140470362112 |
spelling |
sg-ntu-dr.10356-853992020-11-01T04:58:27Z Bridging images and natural language with deep learning Gu, Jiuxiang Cai Jianfei Interdisciplinary Graduate School (IGS) Research Techno Plaza Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval. We, as humans, can easily use our vision and language capabilities to accomplish a wide variety of tasks that combine the image and the text modalities. However, it is not easy for machines because it requires the model to understand the image and language, especially how they relate to each other. In recent years considerable progress has been made in applying deep learning to computer vision and natural language processing, but it is still challenging to connect images with natural language due to the different structures and characteristics between them. In this thesis, I seek to bridge images and natural language with deep learning. Five different methods are proposed to reduce the gap between the image and the text modalities. They are convolutional neural network-based language model for image captioning, coarse-to-fine learning for image captioning, visual-textual cross-modal retrieval with generative models, unpaired image captioning by language pivoting, and unpaired image captioning via scene graph alignments. Overall, the major contributions of this thesis are as follows: • A convolutional neural network-based language model which is suitable for statistical language modeling tasks is proposed in this thesis. This model is fed with all the previous words, while previous recurrent neural network-based language models predict the next word based on one previous word and the hidden state. Its ability of modeling the hierarchical structure and long-term information of words is critical for image captioning. • A coarse-to-fine multi-stage prediction framework for image captioning is proposed in this thesis. The proposed coarse-to-fine approach is composed of multiple decoders, each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Particularly, I optimize the model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards. • A visual-textual cross-modal retrieval with generative learning is proposed in this thesis. Unlike existing cross-modal retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, I propose to incorporate two generative processes (image-to-text and text-to-image) into the cross-modal feature embedding, through which the proposed model is able to learn not only the global abstract features but also the local grounded features. • An unpaired image captioning via language pivoting is proposed in this thesis. I use a pivot language as an intermediary language to bridge the gap between an input image and a caption in the target language. The proposed method captures the characteristics of an image captioner from the pivot language and aligns it to the target language using another pivot-target sentence parallel corpus. In order to guide the target decoder to generate caption-like sentences, I have an autoencoder in the target language that guides the target language decoder to produce caption-like sentences. • An unpaired image captioning via scene graph alignments is proposed in this thesis. The proposed framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, I first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, I propose an unsupervised feature alignment method that maps the scene graph features from the image modality to the sentence modality without any paired training data. Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval. Doctor of Philosophy 2019-11-22T11:16:15Z 2019-12-06T16:03:02Z 2019-11-22T11:16:15Z 2019-12-06T16:03:02Z 2019 Thesis Gu, J. (2019). Bridging images and natural language with deep learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/85399 http://hdl.handle.net/10220/50454 10.32657/10356/85399 en 174 p. application/pdf |