Bridging images and natural language with deep learning

Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks s...

全面介紹

Saved in:

書目詳細資料
主要作者:	Gu, Jiuxiang
其他作者:	Cai Jianfei
格式:	Theses and Dissertations
語言:	English
出版:	2019
主題:	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
在線閱讀:	https://hdl.handle.net/10356/85399 http://hdl.handle.net/10220/50454
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

id	sg-ntu-dr.10356-85399
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Gu, Jiuxiang Bridging images and natural language with deep learning
description	Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval.
author2	Cai Jianfei
author_facet	Cai Jianfei Gu, Jiuxiang
format	Theses and Dissertations
author	Gu, Jiuxiang
author_sort	Gu, Jiuxiang
title	Bridging images and natural language with deep learning
title_short	Bridging images and natural language with deep learning
title_full	Bridging images and natural language with deep learning
title_fullStr	Bridging images and natural language with deep learning
title_full_unstemmed	Bridging images and natural language with deep learning
title_sort	bridging images and natural language with deep learning
publishDate	2019
url	https://hdl.handle.net/10356/85399 http://hdl.handle.net/10220/50454
_version_	1683494140470362112
spelling	sg-ntu-dr.10356-853992020-11-01T04:58:27Z Bridging images and natural language with deep learning Gu, Jiuxiang Cai Jianfei Interdisciplinary Graduate School (IGS) Research Techno Plaza Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval. We, as humans, can easily use our vision and language capabilities to accomplish a wide variety of tasks that combine the image and the text modalities. However, it is not easy for machines because it requires the model to understand the image and language, especially how they relate to each other. In recent years considerable progress has been made in applying deep learning to computer vision and natural language processing, but it is still challenging to connect images with natural language due to the different structures and characteristics between them. In this thesis, I seek to bridge images and natural language with deep learning. Five different methods are proposed to reduce the gap between the image and the text modalities. They are convolutional neural network-based language model for image captioning, coarse-to-fine learning for image captioning, visual-textual cross-modal retrieval with generative models, unpaired image captioning by language pivoting, and unpaired image captioning via scene graph alignments. Overall, the major contributions of this thesis are as follows: • A convolutional neural network-based language model which is suitable for statistical language modeling tasks is proposed in this thesis. This model is fed with all the previous words, while previous recurrent neural network-based language models predict the next word based on one previous word and the hidden state. Its ability of modeling the hierarchical structure and long-term information of words is critical for image captioning. • A coarse-to-fine multi-stage prediction framework for image captioning is proposed in this thesis. The proposed coarse-to-fine approach is composed of multiple decoders, each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Particularly, I optimize the model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards. • A visual-textual cross-modal retrieval with generative learning is proposed in this thesis. Unlike existing cross-modal retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, I propose to incorporate two generative processes (image-to-text and text-to-image) into the cross-modal feature embedding, through which the proposed model is able to learn not only the global abstract features but also the local grounded features. • An unpaired image captioning via language pivoting is proposed in this thesis. I use a pivot language as an intermediary language to bridge the gap between an input image and a caption in the target language. The proposed method captures the characteristics of an image captioner from the pivot language and aligns it to the target language using another pivot-target sentence parallel corpus. In order to guide the target decoder to generate caption-like sentences, I have an autoencoder in the target language that guides the target language decoder to produce caption-like sentences. • An unpaired image captioning via scene graph alignments is proposed in this thesis. The proposed framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, I first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, I propose an unsupervised feature alignment method that maps the scene graph features from the image modality to the sentence modality without any paired training data. Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval. Doctor of Philosophy 2019-11-22T11:16:15Z 2019-12-06T16:03:02Z 2019-11-22T11:16:15Z 2019-12-06T16:03:02Z 2019 Thesis Gu, J. (2019). Bridging images and natural language with deep learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/85399 http://hdl.handle.net/10220/50454 10.32657/10356/85399 en 174 p. application/pdf

Bridging images and natural language with deep learning

相似書籍