Using pre-trained models for vision-language understanding tasks

In recent years, remarkable progress has been made in Artificial Intelligence (AI), with an increasing focus on integrating AI systems into people’s daily lives. In the context of our diverse world, research attention has shifted towards applying AI to multimodal understanding tasks. This thesis spe...

Full description

Saved in:
Bibliographic Details
Main Author: CAO, Rui
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/595
https://ink.library.smu.edu.sg/context/etd_coll/article/1593/viewcontent/Rui_Thesis_PTMs_VLU.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.etd_coll-1593
record_format dspace
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Vision-language understanding
Visual question answering
Hateful meme detection
Pre-trained models
Computer Sciences
Programming Languages and Compilers
spellingShingle Vision-language understanding
Visual question answering
Hateful meme detection
Pre-trained models
Computer Sciences
Programming Languages and Compilers
CAO, Rui
Using pre-trained models for vision-language understanding tasks
description In recent years, remarkable progress has been made in Artificial Intelligence (AI), with an increasing focus on integrating AI systems into people’s daily lives. In the context of our diverse world, research attention has shifted towards applying AI to multimodal understanding tasks. This thesis specifically addresses two key modalities, namely, vision and language, and explores Vision-Language Understanding (VLU). In the past, addressing VLU tasks involved training distinct models from scratch using task-specific data. However, limited by the amount of training data, models may easily overfit the training data and fail to generalize. A recent breakthrough is the development of Pre-trained Models (PTMs), which are trained on extensive datasets to acquire universal representations. Leveraging these PTMs for VLU tasks has become a prevalent approach. The use of PTMs for VLU tasks can be divided into two paradigms: (1) finetuning PTMs with downstream task data, and (2) zero-shot transfer or few-shot learning based on frozen PTMs. However, existing methods under these two paradigms suffer from a few limitations: direct fine-tuning of PTMs may overlook the unique characteristics of the downstream tasks; the zero-shot and few-shot performance of PTMs on some tasks may be poor; and complex VLU tasks may require multiple reasoning skills that a single PTM may not possess. In the thesis, we aim to address the limitations above by optimizing the utilization of PTMs for VLU tasks. Our work can be organized based on whether we leverage fine-tuning or zero-shot / few-shot learning, and whether we adopt a single PTM or a composition of PTMs. When tuning a single PTM, we explore how to incorporate task-specific components to better cater to downstream tasks (Tuning-Single). For VLU tasks where frozen PTMs are not ideal solutions due to poor performance, we investigate using a single frozen PTM to facilitate sub-steps in these tasks (Frozen-Single). On the other hand, we also study how to compose a set of tuned PTMs, each capable of a reasoning skill, to improve the performance on these tasks in the low-resource setting (Tuning-Composition). As VLU tasks may involve multiple skills and multiple reasoning steps, we consider a composition of frozen PTMs and assign reasoning tasks to proper frozen PTMs without requiring any adaptation (Frozen-Composition). Specifically, in this thesis, we narrow down our scope to two VLU tasks, Hateful Meme Detection (HMD) and Visual Question Answering (VQA). HMD classifies a given multimodal meme as either hateful or not hateful, while VQA aims to answer questions related to a given image. The decision to focus on these two tasks stems from their importance in real-world applications. Furthermore, both tasks present non-trivial challenges that demand innovative solution approaches. For the HMD task, most existing work has primarily focused on direct fine-tuning of PTMs, treating HMD as a general multimodal classification task and overlooking its unique characteristics. We address the limitation by integrating task-specific components with PTMs and tuning them end-to-end. We proposed DisMultiHate,which is based on a PTM but learns to disentangle representations of hate speech related target entities in memes to enhance hateful content classification. Additionally, HMD often requires external background knowledge for meme comprehension, yet there are no dedicated knowledge bases constructed for this purpose. In light of this, we explore leveraging knowledge in Pre-trained Language Models (PT-LMs). We propose PromptHate, which prompts PT-LMs and utilizes their implicit knowledge for HMD. Since PT-LMs are inherently textual, PromptHate involves converting images into textual captions with a frozen pre-trained vision-language model (PTVLM). Though achieving good detection performance, PromptHate suffers from noninformative captions. Generic image descriptions may lack crucial details, such as race and gender information, vital for detecting hateful content. To address this, we proposed Pro-Cap, which leverages a frozen PT-VLM to complement PromptHate. Specifically, we prompt a frozen PT-VLM with hateful content-related questions and use the answers as image captions (termed Pro-Cap), ensuring that the captions contain critical information for hateful content detection. While these methods exhibit commendable performance, they heavily rely on extensive supervised learning, demanding large volumes of annotated data. This process is both costly and time-consuming. In response, we further introduce ModHATE, which harnesses the power of a composition of tuned PTMs, each of which possesses an essential reasoning capability for HMD. To the best of our knowledge, Mod-HATE represents a pioneering exploration of hateful meme detection tailored to the few-shot learning setting. For VQA, we study it under the zero-shot transfer setting. Notably, previous zero-shot VQA models overlooked the explicit consideration of multi-step reasoning chains inherent in VQA. To address this oversight, We introduce a modularized zero-shot network that explicitly decomposes questions into sub-reasoning steps, converts sub-reasoning tasks to objectives suitable for PTMs, and assigns tasks to appropriate PTMs without adaptation. Expanding our investigation, we delve into a specific VQA scenario known as knowledge-based VQA (K-VQA). In K-VQA, apart from an image, external knowledge is indispensable for answering the given questions. Recent approaches have utilized pre-trained large language models (LLMs) as both a knowledge source and a zero-shot QA model for K-VQA. However, these recent methods lack explicit demonstration of the knowledge needed to answer questions and thus lack interpretability. To rectify this deficiency, we propose KGENVQA, which first generates knowledge from a frozen LLM and subsequently leverages another frozen LLM for question answering with the incorporation of the generated knowledge. Finally, we conclude the thesis with a summary of our contributions and a discussion of potential future directions regarding the application of PTMs to VLU.
format text
author CAO, Rui
author_facet CAO, Rui
author_sort CAO, Rui
title Using pre-trained models for vision-language understanding tasks
title_short Using pre-trained models for vision-language understanding tasks
title_full Using pre-trained models for vision-language understanding tasks
title_fullStr Using pre-trained models for vision-language understanding tasks
title_full_unstemmed Using pre-trained models for vision-language understanding tasks
title_sort using pre-trained models for vision-language understanding tasks
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/etd_coll/595
https://ink.library.smu.edu.sg/context/etd_coll/article/1593/viewcontent/Rui_Thesis_PTMs_VLU.pdf
_version_ 1814047621784797184
spelling sg-smu-ink.etd_coll-15932024-06-19T03:30:27Z Using pre-trained models for vision-language understanding tasks CAO, Rui In recent years, remarkable progress has been made in Artificial Intelligence (AI), with an increasing focus on integrating AI systems into people’s daily lives. In the context of our diverse world, research attention has shifted towards applying AI to multimodal understanding tasks. This thesis specifically addresses two key modalities, namely, vision and language, and explores Vision-Language Understanding (VLU). In the past, addressing VLU tasks involved training distinct models from scratch using task-specific data. However, limited by the amount of training data, models may easily overfit the training data and fail to generalize. A recent breakthrough is the development of Pre-trained Models (PTMs), which are trained on extensive datasets to acquire universal representations. Leveraging these PTMs for VLU tasks has become a prevalent approach. The use of PTMs for VLU tasks can be divided into two paradigms: (1) finetuning PTMs with downstream task data, and (2) zero-shot transfer or few-shot learning based on frozen PTMs. However, existing methods under these two paradigms suffer from a few limitations: direct fine-tuning of PTMs may overlook the unique characteristics of the downstream tasks; the zero-shot and few-shot performance of PTMs on some tasks may be poor; and complex VLU tasks may require multiple reasoning skills that a single PTM may not possess. In the thesis, we aim to address the limitations above by optimizing the utilization of PTMs for VLU tasks. Our work can be organized based on whether we leverage fine-tuning or zero-shot / few-shot learning, and whether we adopt a single PTM or a composition of PTMs. When tuning a single PTM, we explore how to incorporate task-specific components to better cater to downstream tasks (Tuning-Single). For VLU tasks where frozen PTMs are not ideal solutions due to poor performance, we investigate using a single frozen PTM to facilitate sub-steps in these tasks (Frozen-Single). On the other hand, we also study how to compose a set of tuned PTMs, each capable of a reasoning skill, to improve the performance on these tasks in the low-resource setting (Tuning-Composition). As VLU tasks may involve multiple skills and multiple reasoning steps, we consider a composition of frozen PTMs and assign reasoning tasks to proper frozen PTMs without requiring any adaptation (Frozen-Composition). Specifically, in this thesis, we narrow down our scope to two VLU tasks, Hateful Meme Detection (HMD) and Visual Question Answering (VQA). HMD classifies a given multimodal meme as either hateful or not hateful, while VQA aims to answer questions related to a given image. The decision to focus on these two tasks stems from their importance in real-world applications. Furthermore, both tasks present non-trivial challenges that demand innovative solution approaches. For the HMD task, most existing work has primarily focused on direct fine-tuning of PTMs, treating HMD as a general multimodal classification task and overlooking its unique characteristics. We address the limitation by integrating task-specific components with PTMs and tuning them end-to-end. We proposed DisMultiHate,which is based on a PTM but learns to disentangle representations of hate speech related target entities in memes to enhance hateful content classification. Additionally, HMD often requires external background knowledge for meme comprehension, yet there are no dedicated knowledge bases constructed for this purpose. In light of this, we explore leveraging knowledge in Pre-trained Language Models (PT-LMs). We propose PromptHate, which prompts PT-LMs and utilizes their implicit knowledge for HMD. Since PT-LMs are inherently textual, PromptHate involves converting images into textual captions with a frozen pre-trained vision-language model (PTVLM). Though achieving good detection performance, PromptHate suffers from noninformative captions. Generic image descriptions may lack crucial details, such as race and gender information, vital for detecting hateful content. To address this, we proposed Pro-Cap, which leverages a frozen PT-VLM to complement PromptHate. Specifically, we prompt a frozen PT-VLM with hateful content-related questions and use the answers as image captions (termed Pro-Cap), ensuring that the captions contain critical information for hateful content detection. While these methods exhibit commendable performance, they heavily rely on extensive supervised learning, demanding large volumes of annotated data. This process is both costly and time-consuming. In response, we further introduce ModHATE, which harnesses the power of a composition of tuned PTMs, each of which possesses an essential reasoning capability for HMD. To the best of our knowledge, Mod-HATE represents a pioneering exploration of hateful meme detection tailored to the few-shot learning setting. For VQA, we study it under the zero-shot transfer setting. Notably, previous zero-shot VQA models overlooked the explicit consideration of multi-step reasoning chains inherent in VQA. To address this oversight, We introduce a modularized zero-shot network that explicitly decomposes questions into sub-reasoning steps, converts sub-reasoning tasks to objectives suitable for PTMs, and assigns tasks to appropriate PTMs without adaptation. Expanding our investigation, we delve into a specific VQA scenario known as knowledge-based VQA (K-VQA). In K-VQA, apart from an image, external knowledge is indispensable for answering the given questions. Recent approaches have utilized pre-trained large language models (LLMs) as both a knowledge source and a zero-shot QA model for K-VQA. However, these recent methods lack explicit demonstration of the knowledge needed to answer questions and thus lack interpretability. To rectify this deficiency, we propose KGENVQA, which first generates knowledge from a frozen LLM and subsequently leverages another frozen LLM for question answering with the incorporation of the generated knowledge. Finally, we conclude the thesis with a summary of our contributions and a discussion of potential future directions regarding the application of PTMs to VLU. 2024-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/595 https://ink.library.smu.edu.sg/context/etd_coll/article/1593/viewcontent/Rui_Thesis_PTMs_VLU.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Vision-language understanding Visual question answering Hateful meme detection Pre-trained models Computer Sciences Programming Languages and Compilers