Aligning vision and language for image captioning using deep learning

A longstanding objective in the field of multi-modal research uniting computer vision and natural language processing is to develop models that can comprehend the intricate relationship between vision and language. In recent years, we have witnessed notable developments directed towards this objecti...

Full description

Saved in:
Bibliographic Details
Main Author: Cai, Chen
Other Authors: Yap Kim Hui
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181511
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:A longstanding objective in the field of multi-modal research uniting computer vision and natural language processing is to develop models that can comprehend the intricate relationship between vision and language. In recent years, we have witnessed notable developments directed towards this objective, enabling computers to interpret visual information and articulate it through captions. Although significant advances have been made, aligning complex visual scenes and language for image captioning tasks using deep learning approaches remains challenging due to the distinct characteristics between the modalities. In this thesis, we propose four new methods that effectively align vision and language for image captioning-based tasks through deep learning techniques. These methods include integrating visual and semantic attributes for controllable image captioning, aligning visual and object semantics for grounded image captioning, learning the relationship between multi-scale visual features with language for image change captioning, and bridging visual and global temporally caption knowledge for temporal sentence grounding. The major contributions of this thesis are summarized as follows: A new Attribute Controlled Image Captioning (ACC) method has been proposed that seamlessly integrates semantic attributes with visual content, enabling automatic modification of the generated captions in the fashion domain. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes and styles to incorporate while generating captions. Furthermore, we clean, filter, and assemble a new fashion image caption dataset to facilitate learning and enable us to investigate the effectiveness of our method. A new one-stage Weakly Supervised Grounded Image Captioner (WS-GIC) is proposed that aligns visual and word representation to perform captioning and grounding at the top-down image level. We introduce a Recurrent Grounding Module (RGM) within the decoder to compute Visual Language Attention Maps (VLAMs) for grounding, where VLAMs indicate the spatial region and location of the generated groundable object words in the caption. In addition, we explicitly inject a relation module into our one-stage framework to encourage the relation understanding. The relation semantics aid the prediction of relation words in the caption. A new Interactive Change-aware Transformer Network (ICT-Net) is proposed to extract and incorporate the most critical change of interest in the image, enhancing the generation of change descriptions for complex remote sensing bitemporal scenes. The proposed framework comprises an Interactive Change-aware Encoder (ICE) to capture the crucial difference between bitemporal image features, an Adaptive Fusion Module (AFM) to adaptively aggregate the relevant change-aware features in the encoder layers while minimizing the impact of irrelevant visual features, and a Cross Gated-Attention (CGA) module in the change decoder that enhances the modeling of essential relationships between multi-scale features with word representation, thereby improving change captioning generation. A new Temporal Sentence Grounding (TSG) method is proposed to bridge the domain gap between multi-modal features by leveraging extensive temporally global caption knowledge that is sourced from the relevant video and temporally localized text queries. We introduce the Pseudo-query Intermediary Network (PIN) to contrastively align visual features with temporally global textual knowledge to enhance the similarity between visual and language features. Furthermore, we leverage the pseudo queries prompt to propagate the knowledge, enhancing the learning of feature alignment within the multi-modal fusion module for better temporal grounding. Throughout the thesis, we illustrate how each of the proposed methods aligns vision and language for image captioning-based tasks. Experimental results on public datasets indicate that the proposed methods can achieve better performance. These approaches contribute to enhancing vision and language multi-modalities understanding.