Vision language representation learning

Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision langu...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Xiaofeng
Other Authors: Lin Guosheng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169546
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision language tasks, including visual question answering (VQA), image captioning, visual reasoning, and visual navigation. Despite its significance, learning effective vision language representation remains challenging due to several reasons. Firstly, visual information and language information operate at different levels. Visual information, like image pixels, typically constitutes low-level information, while language information generally carries high-level information with semantic meanings. Thus, establishing suitable correspondences between visual and language concepts proves arduous. Secondly, language information encompasses both objects and object relations, whereas extracting and representing relations among visual concepts pose difficulties. Thirdly, vision and language tasks are often trained and evaluated on biased medium-sized datasets, which limits the ability to learn general representations exclusively from such datasets. Finally, vision language models' predictions lack explainability. This thesis encompasses four approaches that address the aforementioned challenges. Firstly, we propose a tiered network to effectively learn both object-level vision language correspondence and relation-level vision language correspondence. In contrast to prior methods, the tiered reasoning approach dynamically selects object-level candidates based on language representations, facilitating the generation of robust pairwise relations among the selected objects. The proposed tiered relation reasoning method seamlessly integrates with existing visual reasoning frameworks, resulting in significant performance enhancements at minimal computational cost. Secondly, training vision language BERTs with large-scale data often leads to improved vision language representations. To this end, we introduce a self-training approach that enables training vision language BERTs using unlabeled image data. The approach leverages a unified conditional model, a vision language BERT model capable of performing zero-shot conditional generation. With varying conditions, the unified conditional model generates captions, dense captions, and questions. By employing the proposed self-training approach and incorporating just 300k unlabeled extra data, we achieve competitive or even superior performance compared to models of similar sizes trained with three million extra image data. Thirdly, prevailing vision language pretraining models heavily rely on region visual features extracted from object detectors. While they exhibit excellent performance, the extract-then-process pipeline significantly hampers inference speed, limiting their real-world applicability. However, training vision language models directly from raw image pixels presents challenges, as raw image pixels provide significantly less prior knowledge compared to region features. In this study, we systematically investigate the utilization of auxiliary visual pretraining tasks to facilitate training end-to-end vision language models. We introduce three visual losses that expedite convergence and enhance fine-tuning accuracy. Our end-to-end models outperform or achieve similar performance as region feature models on downstream tasks, while exhibiting more than a tenfold increase in inference speed. Additionally, our proposed method attains comparable or superior performance to other end-to-end models with only ten percent of the pretraining GPU hours. Lastly, the decision-making process of vision language models remains opaque. To address this, we propose an inductive logic programming method that enables the explanation of vision language models in formal logic language.