Vision language representation learning

Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision langu...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Yang, Xiaofeng
مؤلفون آخرون:	Lin Guosheng
التنسيق:	Thesis-Doctor of Philosophy
اللغة:	English
منشور في:	Nanyang Technological University 2023
الموضوعات:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/169546
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

id	sg-ntu-dr.10356-169546
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Yang, Xiaofeng Vision language representation learning
description	Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision language tasks, including visual question answering (VQA), image captioning, visual reasoning, and visual navigation. Despite its significance, learning effective vision language representation remains challenging due to several reasons. Firstly, visual information and language information operate at different levels. Visual information, like image pixels, typically constitutes low-level information, while language information generally carries high-level information with semantic meanings. Thus, establishing suitable correspondences between visual and language concepts proves arduous. Secondly, language information encompasses both objects and object relations, whereas extracting and representing relations among visual concepts pose difficulties. Thirdly, vision and language tasks are often trained and evaluated on biased medium-sized datasets, which limits the ability to learn general representations exclusively from such datasets. Finally, vision language models' predictions lack explainability. This thesis encompasses four approaches that address the aforementioned challenges. Firstly, we propose a tiered network to effectively learn both object-level vision language correspondence and relation-level vision language correspondence. In contrast to prior methods, the tiered reasoning approach dynamically selects object-level candidates based on language representations, facilitating the generation of robust pairwise relations among the selected objects. The proposed tiered relation reasoning method seamlessly integrates with existing visual reasoning frameworks, resulting in significant performance enhancements at minimal computational cost. Secondly, training vision language BERTs with large-scale data often leads to improved vision language representations. To this end, we introduce a self-training approach that enables training vision language BERTs using unlabeled image data. The approach leverages a unified conditional model, a vision language BERT model capable of performing zero-shot conditional generation. With varying conditions, the unified conditional model generates captions, dense captions, and questions. By employing the proposed self-training approach and incorporating just 300k unlabeled extra data, we achieve competitive or even superior performance compared to models of similar sizes trained with three million extra image data. Thirdly, prevailing vision language pretraining models heavily rely on region visual features extracted from object detectors. While they exhibit excellent performance, the extract-then-process pipeline significantly hampers inference speed, limiting their real-world applicability. However, training vision language models directly from raw image pixels presents challenges, as raw image pixels provide significantly less prior knowledge compared to region features. In this study, we systematically investigate the utilization of auxiliary visual pretraining tasks to facilitate training end-to-end vision language models. We introduce three visual losses that expedite convergence and enhance fine-tuning accuracy. Our end-to-end models outperform or achieve similar performance as region feature models on downstream tasks, while exhibiting more than a tenfold increase in inference speed. Additionally, our proposed method attains comparable or superior performance to other end-to-end models with only ten percent of the pretraining GPU hours. Lastly, the decision-making process of vision language models remains opaque. To address this, we propose an inductive logic programming method that enables the explanation of vision language models in formal logic language.
author2	Lin Guosheng
author_facet	Lin Guosheng Yang, Xiaofeng
format	Thesis-Doctor of Philosophy
author	Yang, Xiaofeng
author_sort	Yang, Xiaofeng
title	Vision language representation learning
title_short	Vision language representation learning
title_full	Vision language representation learning
title_fullStr	Vision language representation learning
title_full_unstemmed	Vision language representation learning
title_sort	vision language representation learning
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/169546
_version_	1773551390070669312
spelling	sg-ntu-dr.10356-1695462023-08-01T07:08:34Z Vision language representation learning Yang, Xiaofeng Lin Guosheng School of Computer Science and Engineering gslin@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Vision and language tasks have garnered increasing research attention in recent years, finding extensive applications in commercial software such as Microsoft Office, Amazon cloud services, and the iOS photo app. Vision language representation learning is a crucial component of numerous vision language tasks, including visual question answering (VQA), image captioning, visual reasoning, and visual navigation. Despite its significance, learning effective vision language representation remains challenging due to several reasons. Firstly, visual information and language information operate at different levels. Visual information, like image pixels, typically constitutes low-level information, while language information generally carries high-level information with semantic meanings. Thus, establishing suitable correspondences between visual and language concepts proves arduous. Secondly, language information encompasses both objects and object relations, whereas extracting and representing relations among visual concepts pose difficulties. Thirdly, vision and language tasks are often trained and evaluated on biased medium-sized datasets, which limits the ability to learn general representations exclusively from such datasets. Finally, vision language models' predictions lack explainability. This thesis encompasses four approaches that address the aforementioned challenges. Firstly, we propose a tiered network to effectively learn both object-level vision language correspondence and relation-level vision language correspondence. In contrast to prior methods, the tiered reasoning approach dynamically selects object-level candidates based on language representations, facilitating the generation of robust pairwise relations among the selected objects. The proposed tiered relation reasoning method seamlessly integrates with existing visual reasoning frameworks, resulting in significant performance enhancements at minimal computational cost. Secondly, training vision language BERTs with large-scale data often leads to improved vision language representations. To this end, we introduce a self-training approach that enables training vision language BERTs using unlabeled image data. The approach leverages a unified conditional model, a vision language BERT model capable of performing zero-shot conditional generation. With varying conditions, the unified conditional model generates captions, dense captions, and questions. By employing the proposed self-training approach and incorporating just 300k unlabeled extra data, we achieve competitive or even superior performance compared to models of similar sizes trained with three million extra image data. Thirdly, prevailing vision language pretraining models heavily rely on region visual features extracted from object detectors. While they exhibit excellent performance, the extract-then-process pipeline significantly hampers inference speed, limiting their real-world applicability. However, training vision language models directly from raw image pixels presents challenges, as raw image pixels provide significantly less prior knowledge compared to region features. In this study, we systematically investigate the utilization of auxiliary visual pretraining tasks to facilitate training end-to-end vision language models. We introduce three visual losses that expedite convergence and enhance fine-tuning accuracy. Our end-to-end models outperform or achieve similar performance as region feature models on downstream tasks, while exhibiting more than a tenfold increase in inference speed. Additionally, our proposed method attains comparable or superior performance to other end-to-end models with only ten percent of the pretraining GPU hours. Lastly, the decision-making process of vision language models remains opaque. To address this, we propose an inductive logic programming method that enables the explanation of vision language models in formal logic language. Doctor of Philosophy 2023-07-24T06:19:32Z 2023-07-24T06:19:32Z 2023 Thesis-Doctor of Philosophy Yang, X. (2023). Vision language representation learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169546 https://hdl.handle.net/10356/169546 10.32657/10356/169546 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Vision language representation learning

مواد مشابهة