Vision language representation learning

Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision langu...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Xiaofeng
Other Authors: Lin Guosheng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169546
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-169546
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Yang, Xiaofeng
Vision language representation learning
description Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision language tasks, including visual question answering (VQA), image captioning, visual reasoning, and visual navigation. Despite its significance, learning effective vision language representation remains challenging due to several reasons. Firstly, visual information and language information operate at different levels. Visual information, like image pixels, typically constitutes low-level information, while language information generally carries high-level information with semantic meanings. Thus, establishing suitable correspondences between visual and language concepts proves arduous. Secondly, language information encompasses both objects and object relations, whereas extracting and representing relations among visual concepts pose difficulties. Thirdly, vision and language tasks are often trained and evaluated on biased medium-sized datasets, which limits the ability to learn general representations exclusively from such datasets. Finally, vision language models' predictions lack explainability. This thesis encompasses four approaches that address the aforementioned challenges. Firstly, we propose a tiered network to effectively learn both object-level vision language correspondence and relation-level vision language correspondence. In contrast to prior methods, the tiered reasoning approach dynamically selects object-level candidates based on language representations, facilitating the generation of robust pairwise relations among the selected objects. The proposed tiered relation reasoning method seamlessly integrates with existing visual reasoning frameworks, resulting in significant performance enhancements at minimal computational cost. Secondly, training vision language BERTs with large-scale data often leads to improved vision language representations. To this end, we introduce a self-training approach that enables training vision language BERTs using unlabeled image data. The approach leverages a unified conditional model, a vision language BERT model capable of performing zero-shot conditional generation. With varying conditions, the unified conditional model generates captions, dense captions, and questions. By employing the proposed self-training approach and incorporating just 300k unlabeled extra data, we achieve competitive or even superior performance compared to models of similar sizes trained with three million extra image data. Thirdly, prevailing vision language pretraining models heavily rely on region visual features extracted from object detectors. While they exhibit excellent performance, the extract-then-process pipeline significantly hampers inference speed, limiting their real-world applicability. However, training vision language models directly from raw image pixels presents challenges, as raw image pixels provide significantly less prior knowledge compared to region features. In this study, we systematically investigate the utilization of auxiliary visual pretraining tasks to facilitate training end-to-end vision language models. We introduce three visual losses that expedite convergence and enhance fine-tuning accuracy. Our end-to-end models outperform or achieve similar performance as region feature models on downstream tasks, while exhibiting more than a tenfold increase in inference speed. Additionally, our proposed method attains comparable or superior performance to other end-to-end models with only ten percent of the pretraining GPU hours. Lastly, the decision-making process of vision language models remains opaque. To address this, we propose an inductive logic programming method that enables the explanation of vision language models in formal logic language.
author2 Lin Guosheng
author_facet Lin Guosheng
Yang, Xiaofeng
format Thesis-Doctor of Philosophy
author Yang, Xiaofeng
author_sort Yang, Xiaofeng
title Vision language representation learning
title_short Vision language representation learning
title_full Vision language representation learning
title_fullStr Vision language representation learning
title_full_unstemmed Vision language representation learning
title_sort vision language representation learning
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/169546
_version_ 1773551390070669312
spelling sg-ntu-dr.10356-1695462023-08-01T07:08:34Z Vision language representation learning Yang, Xiaofeng Lin Guosheng School of Computer Science and Engineering gslin@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision language tasks, including visual question answering (VQA), image captioning, visual reasoning, and visual navigation. Despite its significance, learning effective vision language representation remains challenging due to several reasons. Firstly, visual information and language information operate at different levels. Visual information, like image pixels, typically constitutes low-level information, while language information generally carries high-level information with semantic meanings. Thus, establishing suitable correspondences between visual and language concepts proves arduous. Secondly, language information encompasses both objects and object relations, whereas extracting and representing relations among visual concepts pose difficulties. Thirdly, vision and language tasks are often trained and evaluated on biased medium-sized datasets, which limits the ability to learn general representations exclusively from such datasets. Finally, vision language models' predictions lack explainability. This thesis encompasses four approaches that address the aforementioned challenges. Firstly, we propose a tiered network to effectively learn both object-level vision language correspondence and relation-level vision language correspondence. In contrast to prior methods, the tiered reasoning approach dynamically selects object-level candidates based on language representations, facilitating the generation of robust pairwise relations among the selected objects. The proposed tiered relation reasoning method seamlessly integrates with existing visual reasoning frameworks, resulting in significant performance enhancements at minimal computational cost. Secondly, training vision language BERTs with large-scale data often leads to improved vision language representations. To this end, we introduce a self-training approach that enables training vision language BERTs using unlabeled image data. The approach leverages a unified conditional model, a vision language BERT model capable of performing zero-shot conditional generation. With varying conditions, the unified conditional model generates captions, dense captions, and questions. By employing the proposed self-training approach and incorporating just 300k unlabeled extra data, we achieve competitive or even superior performance compared to models of similar sizes trained with three million extra image data. Thirdly, prevailing vision language pretraining models heavily rely on region visual features extracted from object detectors. While they exhibit excellent performance, the extract-then-process pipeline significantly hampers inference speed, limiting their real-world applicability. However, training vision language models directly from raw image pixels presents challenges, as raw image pixels provide significantly less prior knowledge compared to region features. In this study, we systematically investigate the utilization of auxiliary visual pretraining tasks to facilitate training end-to-end vision language models. We introduce three visual losses that expedite convergence and enhance fine-tuning accuracy. Our end-to-end models outperform or achieve similar performance as region feature models on downstream tasks, while exhibiting more than a tenfold increase in inference speed. Additionally, our proposed method attains comparable or superior performance to other end-to-end models with only ten percent of the pretraining GPU hours. Lastly, the decision-making process of vision language models remains opaque. To address this, we propose an inductive logic programming method that enables the explanation of vision language models in formal logic language. Doctor of Philosophy 2023-07-24T06:19:32Z 2023-07-24T06:19:32Z 2023 Thesis-Doctor of Philosophy Yang, X. (2023). Vision language representation learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169546 https://hdl.handle.net/10356/169546 10.32657/10356/169546 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University