Modularized zero-shot VQA with pre-trained models

Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA).Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capabi...

Full description

Saved in:
Bibliographic Details
Main Authors: CAO, Rui, JIANG, Jing
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8307
https://ink.library.smu.edu.sg/context/sis_research/article/9310/viewcontent/ACL_Findings_Camera_Ready.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9310
record_format dspace
spelling sg-smu-ink.sis_research-93102023-12-05T03:18:51Z Modularized zero-shot VQA with pre-trained models CAO, Rui JIANG, Jing Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA).Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines. 2023-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8307 info:doi/10.18653/v1/2023.findings-acl.5 https://ink.library.smu.edu.sg/context/sis_research/article/9310/viewcontent/ACL_Findings_Camera_Ready.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computational linguistics Zero-shot learning Object detection Artificial Intelligence and Robotics
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Computational linguistics
Zero-shot learning
Object detection
Artificial Intelligence and Robotics
spellingShingle Computational linguistics
Zero-shot learning
Object detection
Artificial Intelligence and Robotics
CAO, Rui
JIANG, Jing
Modularized zero-shot VQA with pre-trained models
description Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA).Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.
format text
author CAO, Rui
JIANG, Jing
author_facet CAO, Rui
JIANG, Jing
author_sort CAO, Rui
title Modularized zero-shot VQA with pre-trained models
title_short Modularized zero-shot VQA with pre-trained models
title_full Modularized zero-shot VQA with pre-trained models
title_fullStr Modularized zero-shot VQA with pre-trained models
title_full_unstemmed Modularized zero-shot VQA with pre-trained models
title_sort modularized zero-shot vqa with pre-trained models
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8307
https://ink.library.smu.edu.sg/context/sis_research/article/9310/viewcontent/ACL_Findings_Camera_Ready.pdf
_version_ 1784855628062130176