Incorporating additional knowledge into image captioners

Image Captioning (IC) is one of the most important visual reasoning tasks which lies at the intersection of computer vision and natural language processing that requires a machine to generate a fluent caption to correctly describe the given image. Compared with preliminary template/statistic based c...

Full description

Saved in:

Bibliographic Details
Main Author:	Xu, Yang
Other Authors:	Zhang Hanwang
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Computer applications
Online Access:	https://hdl.handle.net/10356/151726
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-151726
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computer applications
spellingShingle	Engineering::Computer science and engineering::Computer applications Xu, Yang Incorporating additional knowledge into image captioners
description	Image Captioning (IC) is one of the most important visual reasoning tasks which lies at the intersection of computer vision and natural language processing that requires a machine to generate a fluent caption to correctly describe the given image. Compared with preliminary template/statistic based captioners, modern encoder-decoder framework based captioners have achieved significant improvements. Such a framework deploys a convolutional neural network (CNN) and a recurrent neural network (RNN) as the visual encoder for extracting visual features and the language decoder for generating words from the extracted features, respectively. However, in the encoder-decoder pipeline, since the visual encoder is well pre-trained by object detection task, the learned visual representations will focus more on objects' categories while neglecting other important knowledge like objects' attributes or relationships. As a result, most of the existent captioning systems prefer to recognize objects first and then infer the other words from the recognized objects. Thus, these captioning systems prefer to generate rigid and inflexible descriptions, or even worse, these systems are likely to overfit to the dataset bias. In this thesis, we propose some novel captioners which exploit different additional knowledge to generate more descriptive and less biased captions to alleviate the abovementioned limitations. In this thesis, we have the following major contributions: A novel \textbf{Shuffle-Then-Assemble (STA)} pre-training strategy is proposed for learning object-agnostic visual features to construct less biased scene graphs. In this strategy, we discard all the triplet relationship annotations in an image and leave unpaired object domains without object alignments. Then we try to recover possible obj1-obj2 pairs for learning features. A cycle of residual transformations between the two domains is also designed where the identity mappings encourage the RoI features to capture shared but not object-specific visual patterns. In this way, object-agnostic visual features can be learned and less biased scene graphs can be constructed. A novel \textbf{Scene Graph Auto-Encoder (SGAE)} is proposed to incorporate the language inductive bias as the additional knowledge into the encoder-decoder image captioning framework for more human-like captions. We preserve such language inductive bias into a dictionary set through reconstructing sentences from these sentences' scene graphs. Then during image captioning, we extract a visual scene graph from the given image and then retrieval the preserved language inductive bias from the dictionary set for more descriptive captions. A \textbf{Hierarchical Scene Graph Encoder-Decoder (HSGED)} paragraph captioner is proposed to generate coherent and distinctive paragraphs. In this model, we also exploit the scene graph as the additional knowledge, which acts as the ``script'' to guide the paragraph generation. In particular, a sentence scene graph RNN (SSG-RNN) is designed to generate sub-graph level topics, which constrains the word scene graph RNN (WSG-RNN) to complete the corresponding sentences. Also, we propose irredundant attention in SSG-RNN and inheriting attention in WSG-RNN to respectively to improve the possibility of abstracting topics from rarely described sub-graphs and to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. A novel captioner: \textbf{learning to Collocate Visual-Linguistic Modules (CNM)} is proposed to imitate humans who compose captions by first structuring a sentence pattern like \textsc{sth do sth at someplace} and then filling in the detailed descriptions. To achieve this, we propose a neural module network for generating captions that we design specific modules for generating nouns, adjectives, and verbs. Also, soft module fusion, multi-step module execution, and a linguistic loss are deployed to make module controller be more faithful to part-of-speech collocations. In this way, our CVLNM can not only generate more correct captions, but also be more robust to fewer training samples compared with the concurrent state of the art captioners. A novel \textbf{Deconfounded Image Captioning (DIC)} model is proposed to generate less biased captions. We follow the framework of causal inference to design this captioner, which can alleviate the negative effects brought by the dataset bias. We also analyze why modern image captioners are easily affected by dataset bias from the perspective of confounding effect and then propose DICv1.0, which exploits additional commonsense knowledge as the mediator to transmit information from the image to the caption, to alleviate the negative effects brought by dataset bias. More importantly, we retrospect the major captioners in our 6-year-old IC community from the causal view and show how this causal retrospect enlightens us to develop our DICv1.0.
author2	Zhang Hanwang
author_facet	Zhang Hanwang Xu, Yang
format	Thesis-Doctor of Philosophy
author	Xu, Yang
author_sort	Xu, Yang
title	Incorporating additional knowledge into image captioners
title_short	Incorporating additional knowledge into image captioners
title_full	Incorporating additional knowledge into image captioners
title_fullStr	Incorporating additional knowledge into image captioners
title_full_unstemmed	Incorporating additional knowledge into image captioners
title_sort	incorporating additional knowledge into image captioners
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/151726
_version_	1710686941887856640
spelling	sg-ntu-dr.10356-1517262021-09-06T02:34:41Z Incorporating additional knowledge into image captioners Xu, Yang Zhang Hanwang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Engineering::Computer science and engineering::Computer applications Image Captioning (IC) is one of the most important visual reasoning tasks which lies at the intersection of computer vision and natural language processing that requires a machine to generate a fluent caption to correctly describe the given image. Compared with preliminary template/statistic based captioners, modern encoder-decoder framework based captioners have achieved significant improvements. Such a framework deploys a convolutional neural network (CNN) and a recurrent neural network (RNN) as the visual encoder for extracting visual features and the language decoder for generating words from the extracted features, respectively. However, in the encoder-decoder pipeline, since the visual encoder is well pre-trained by object detection task, the learned visual representations will focus more on objects' categories while neglecting other important knowledge like objects' attributes or relationships. As a result, most of the existent captioning systems prefer to recognize objects first and then infer the other words from the recognized objects. Thus, these captioning systems prefer to generate rigid and inflexible descriptions, or even worse, these systems are likely to overfit to the dataset bias. In this thesis, we propose some novel captioners which exploit different additional knowledge to generate more descriptive and less biased captions to alleviate the abovementioned limitations. In this thesis, we have the following major contributions: A novel \textbf{Shuffle-Then-Assemble (STA)} pre-training strategy is proposed for learning object-agnostic visual features to construct less biased scene graphs. In this strategy, we discard all the triplet relationship annotations in an image and leave unpaired object domains without object alignments. Then we try to recover possible obj1-obj2 pairs for learning features. A cycle of residual transformations between the two domains is also designed where the identity mappings encourage the RoI features to capture shared but not object-specific visual patterns. In this way, object-agnostic visual features can be learned and less biased scene graphs can be constructed. A novel \textbf{Scene Graph Auto-Encoder (SGAE)} is proposed to incorporate the language inductive bias as the additional knowledge into the encoder-decoder image captioning framework for more human-like captions. We preserve such language inductive bias into a dictionary set through reconstructing sentences from these sentences' scene graphs. Then during image captioning, we extract a visual scene graph from the given image and then retrieval the preserved language inductive bias from the dictionary set for more descriptive captions. A \textbf{Hierarchical Scene Graph Encoder-Decoder (HSGED)} paragraph captioner is proposed to generate coherent and distinctive paragraphs. In this model, we also exploit the scene graph as the additional knowledge, which acts as the ``script'' to guide the paragraph generation. In particular, a sentence scene graph RNN (SSG-RNN) is designed to generate sub-graph level topics, which constrains the word scene graph RNN (WSG-RNN) to complete the corresponding sentences. Also, we propose irredundant attention in SSG-RNN and inheriting attention in WSG-RNN to respectively to improve the possibility of abstracting topics from rarely described sub-graphs and to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. A novel captioner: \textbf{learning to Collocate Visual-Linguistic Modules (CNM)} is proposed to imitate humans who compose captions by first structuring a sentence pattern like \textsc{sth do sth at someplace} and then filling in the detailed descriptions. To achieve this, we propose a neural module network for generating captions that we design specific modules for generating nouns, adjectives, and verbs. Also, soft module fusion, multi-step module execution, and a linguistic loss are deployed to make module controller be more faithful to part-of-speech collocations. In this way, our CVLNM can not only generate more correct captions, but also be more robust to fewer training samples compared with the concurrent state of the art captioners. A novel \textbf{Deconfounded Image Captioning (DIC)} model is proposed to generate less biased captions. We follow the framework of causal inference to design this captioner, which can alleviate the negative effects brought by the dataset bias. We also analyze why modern image captioners are easily affected by dataset bias from the perspective of confounding effect and then propose DICv1.0, which exploits additional commonsense knowledge as the mediator to transmit information from the image to the caption, to alleviate the negative effects brought by dataset bias. More importantly, we retrospect the major captioners in our 6-year-old IC community from the causal view and show how this causal retrospect enlightens us to develop our DICv1.0. Doctor of Philosophy 2021-07-01T05:13:13Z 2021-07-01T05:13:13Z 2021 Thesis-Doctor of Philosophy Xu, Y. (2021). Incorporating additional knowledge into image captioners. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/151726 https://hdl.handle.net/10356/151726 10.32657/10356/151726 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Incorporating additional knowledge into image captioners

Similar Items