Vision language representation learning
Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision langu...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/169546 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-169546 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Yang, Xiaofeng Vision language representation learning |
description |
Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision language tasks, including visual question answering (VQA), image captioning, visual reasoning, and visual navigation. Despite its significance, learning effective vision language representation remains challenging due to several reasons. Firstly, visual information and language information operate at different levels. Visual information, like image pixels, typically constitutes low-level information, while language information generally carries high-level information with semantic meanings. Thus, establishing suitable correspondences between visual and language concepts proves arduous. Secondly, language information encompasses both objects and object relations, whereas extracting and representing relations among visual concepts pose difficulties. Thirdly, vision and language tasks are often trained and evaluated on biased medium-sized datasets, which limits the ability to learn general representations exclusively from such datasets. Finally, vision language models' predictions lack explainability.
This thesis encompasses four approaches that address the aforementioned challenges.
Firstly, we propose a tiered network to effectively learn both object-level vision language correspondence and relation-level vision language correspondence. In contrast to prior methods, the tiered reasoning approach dynamically selects object-level candidates based on language representations, facilitating the generation of robust pairwise relations among the selected objects. The proposed tiered relation reasoning method seamlessly integrates with existing visual reasoning frameworks, resulting in significant performance enhancements at minimal computational cost.
Secondly, training vision language BERTs with large-scale data often leads to improved vision language representations. To this end, we introduce a self-training approach that enables training vision language BERTs using unlabeled image data. The approach leverages a unified conditional model, a vision language BERT model capable of performing zero-shot conditional generation. With varying conditions, the unified conditional model generates captions, dense captions, and questions. By employing the proposed self-training approach and incorporating just 300k unlabeled extra data, we achieve competitive or even superior performance compared to models of similar sizes trained with three million extra image data.
Thirdly, prevailing vision language pretraining models heavily rely on region visual features extracted from object detectors. While they exhibit excellent performance, the extract-then-process pipeline significantly hampers inference speed, limiting their real-world applicability. However, training vision language models directly from raw image pixels presents challenges, as raw image pixels provide significantly less prior knowledge compared to region features. In this study, we systematically investigate the utilization of auxiliary visual pretraining tasks to facilitate training end-to-end vision language models. We introduce three visual losses that expedite convergence and enhance fine-tuning accuracy. Our end-to-end models outperform or achieve similar performance as region feature models on downstream tasks, while exhibiting more than a tenfold increase in inference speed. Additionally, our proposed method attains comparable or superior performance to other end-to-end models with only ten percent of the pretraining GPU hours.
Lastly, the decision-making process of vision language models remains opaque. To address this, we propose an inductive logic programming method that enables the explanation of vision language models in formal logic language. |
author2 |
Lin Guosheng |
author_facet |
Lin Guosheng Yang, Xiaofeng |
format |
Thesis-Doctor of Philosophy |
author |
Yang, Xiaofeng |
author_sort |
Yang, Xiaofeng |
title |
Vision language representation learning |
title_short |
Vision language representation learning |
title_full |
Vision language representation learning |
title_fullStr |
Vision language representation learning |
title_full_unstemmed |
Vision language representation learning |
title_sort |
vision language representation learning |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/169546 |
_version_ |
1773551390070669312 |
spelling |
sg-ntu-dr.10356-1695462023-08-01T07:08:34Z Vision language representation learning Yang, Xiaofeng Lin Guosheng School of Computer Science and Engineering gslin@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision language tasks, including visual question answering (VQA), image captioning, visual reasoning, and visual navigation. Despite its significance, learning effective vision language representation remains challenging due to several reasons. Firstly, visual information and language information operate at different levels. Visual information, like image pixels, typically constitutes low-level information, while language information generally carries high-level information with semantic meanings. Thus, establishing suitable correspondences between visual and language concepts proves arduous. Secondly, language information encompasses both objects and object relations, whereas extracting and representing relations among visual concepts pose difficulties. Thirdly, vision and language tasks are often trained and evaluated on biased medium-sized datasets, which limits the ability to learn general representations exclusively from such datasets. Finally, vision language models' predictions lack explainability. This thesis encompasses four approaches that address the aforementioned challenges. Firstly, we propose a tiered network to effectively learn both object-level vision language correspondence and relation-level vision language correspondence. In contrast to prior methods, the tiered reasoning approach dynamically selects object-level candidates based on language representations, facilitating the generation of robust pairwise relations among the selected objects. The proposed tiered relation reasoning method seamlessly integrates with existing visual reasoning frameworks, resulting in significant performance enhancements at minimal computational cost. Secondly, training vision language BERTs with large-scale data often leads to improved vision language representations. To this end, we introduce a self-training approach that enables training vision language BERTs using unlabeled image data. The approach leverages a unified conditional model, a vision language BERT model capable of performing zero-shot conditional generation. With varying conditions, the unified conditional model generates captions, dense captions, and questions. By employing the proposed self-training approach and incorporating just 300k unlabeled extra data, we achieve competitive or even superior performance compared to models of similar sizes trained with three million extra image data. Thirdly, prevailing vision language pretraining models heavily rely on region visual features extracted from object detectors. While they exhibit excellent performance, the extract-then-process pipeline significantly hampers inference speed, limiting their real-world applicability. However, training vision language models directly from raw image pixels presents challenges, as raw image pixels provide significantly less prior knowledge compared to region features. In this study, we systematically investigate the utilization of auxiliary visual pretraining tasks to facilitate training end-to-end vision language models. We introduce three visual losses that expedite convergence and enhance fine-tuning accuracy. Our end-to-end models outperform or achieve similar performance as region feature models on downstream tasks, while exhibiting more than a tenfold increase in inference speed. Additionally, our proposed method attains comparable or superior performance to other end-to-end models with only ten percent of the pretraining GPU hours. Lastly, the decision-making process of vision language models remains opaque. To address this, we propose an inductive logic programming method that enables the explanation of vision language models in formal logic language. Doctor of Philosophy 2023-07-24T06:19:32Z 2023-07-24T06:19:32Z 2023 Thesis-Doctor of Philosophy Yang, X. (2023). Vision language representation learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169546 https://hdl.handle.net/10356/169546 10.32657/10356/169546 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |