Incorporating additional knowledge into image captioners
Image Captioning (IC) is one of the most important visual reasoning tasks which lies at the intersection of computer vision and natural language processing that requires a machine to generate a fluent caption to correctly describe the given image. Compared with preliminary template/statistic based c...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/151726 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Image Captioning (IC) is one of the most important visual reasoning tasks which lies at the intersection of computer vision and natural language processing that requires a machine to generate a fluent caption to correctly describe the given image. Compared with preliminary template/statistic based captioners, modern encoder-decoder framework based captioners have achieved significant improvements. Such a framework deploys a convolutional neural network (CNN) and a recurrent neural network (RNN) as the visual encoder for extracting visual features and the language decoder for generating words from the extracted features, respectively. However, in the encoder-decoder pipeline, since the visual encoder is well pre-trained by object detection task, the learned visual representations will focus more on objects' categories while neglecting other important knowledge like objects' attributes or relationships. As a result, most of the existent captioning systems prefer to recognize objects first and then infer the other words from the recognized objects. Thus, these captioning systems prefer to generate rigid and inflexible descriptions, or even worse, these systems are likely to overfit to the dataset bias.
In this thesis, we propose some novel captioners which exploit different additional knowledge to generate more descriptive and less biased captions to alleviate the abovementioned limitations. In this thesis, we have the following major contributions:
A novel \textbf{Shuffle-Then-Assemble (STA)} pre-training strategy is proposed for learning object-agnostic visual features to construct less biased scene graphs. In this strategy, we discard all the triplet relationship annotations in an image and leave unpaired object domains without object alignments. Then we try to recover possible obj1-obj2 pairs for learning features. A cycle of residual transformations between the two domains is also designed where the identity mappings encourage the RoI features to capture shared but not object-specific visual patterns. In this way, object-agnostic visual features can be learned and less biased scene graphs can be constructed.
A novel \textbf{Scene Graph Auto-Encoder (SGAE)} is proposed to incorporate the language inductive bias as the additional knowledge into the encoder-decoder image captioning framework for more human-like captions. We preserve such language inductive bias into a dictionary set through reconstructing sentences from these sentences' scene graphs. Then during image captioning, we extract a visual scene graph from the given image and then retrieval the preserved language inductive bias from the dictionary set for more descriptive captions.
A \textbf{Hierarchical Scene Graph Encoder-Decoder (HSGED)} paragraph captioner is proposed to generate coherent and distinctive paragraphs. In this model, we also exploit the scene graph as the additional knowledge, which acts as the ``script'' to guide the paragraph generation. In particular, a sentence scene graph RNN (SSG-RNN) is designed to generate sub-graph level topics, which constrains the word scene graph RNN (WSG-RNN) to complete the corresponding sentences. Also, we propose irredundant attention in SSG-RNN and inheriting attention in WSG-RNN to respectively to improve the possibility of abstracting topics from rarely described sub-graphs and to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs.
A novel captioner: \textbf{learning to Collocate Visual-Linguistic Modules (CNM)} is proposed to imitate humans who compose captions by first structuring a sentence pattern like \textsc{sth do sth at someplace} and then filling in the detailed descriptions. To achieve this, we propose a neural module network for generating captions that we design specific modules for generating nouns, adjectives, and verbs. Also, soft module fusion, multi-step module execution, and a linguistic loss are deployed to make module controller be more faithful to part-of-speech collocations. In this way, our CVLNM can not only generate more correct captions, but also be more robust to fewer training samples compared with the concurrent state of the art captioners.
A novel \textbf{Deconfounded Image Captioning (DIC)} model is proposed to generate less biased captions. We follow the framework of causal inference to design this captioner, which can alleviate the negative effects brought by the dataset bias. We also analyze why modern image captioners are easily affected by dataset bias from the perspective of confounding effect and then propose DICv1.0, which exploits additional commonsense knowledge as the mediator to transmit information from the image to the caption, to alleviate the negative effects brought by dataset bias. More importantly, we retrospect the major captioners in our 6-year-old IC community from the causal view and show how this causal retrospect enlightens us to develop our DICv1.0. |
---|