ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense

Humans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a visio...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHOU, Kankan, LAI, Eason, YEONG, Au Wei Bin, MOURATIDIS, Kyriakos, JIANG, Jing
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8352
https://ink.library.smu.edu.sg/context/sis_research/article/9355/viewcontent/2023.findings_emnlp.683.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9355
record_format dspace
spelling sg-smu-ink.sis_research-93552023-12-19T03:36:21Z ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense ZHOU, Kankan LAI, Eason YEONG, Au Wei Bin MOURATIDIS, Kyriakos JIANG, Jing Humans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a vision-language model, whose reasoning could gravitate towards the common scenario that the fish is inside the bowl, despite the visual input. In this paper, we introduce a novel probing dataset named ROME (reasoning beyond commonsense knowledge) to evaluate whether the state-of-the-art pre-trained vision-language models have the reasoning capability to correctly interpret counter-intuitive content. ROME contains images that defy commonsense knowledge with regards to color, shape, material, size and positional relation. Experiments on the state-of-the-art pre-trained vision-language models reveal that most of these models are still largely incapable of interpreting counter-intuitive scenarios. We hope that ROME will spur further investigations on reasoning beyond commonsense knowledge in vision-language research. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8352 info:doi/10.48550/arXiv.2310.19301 https://ink.library.smu.edu.sg/context/sis_research/article/9355/viewcontent/2023.findings_emnlp.683.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Artificial Intelligence and Robotics
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Artificial Intelligence and Robotics
spellingShingle Artificial Intelligence and Robotics
ZHOU, Kankan
LAI, Eason
YEONG, Au Wei Bin
MOURATIDIS, Kyriakos
JIANG, Jing
ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
description Humans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a vision-language model, whose reasoning could gravitate towards the common scenario that the fish is inside the bowl, despite the visual input. In this paper, we introduce a novel probing dataset named ROME (reasoning beyond commonsense knowledge) to evaluate whether the state-of-the-art pre-trained vision-language models have the reasoning capability to correctly interpret counter-intuitive content. ROME contains images that defy commonsense knowledge with regards to color, shape, material, size and positional relation. Experiments on the state-of-the-art pre-trained vision-language models reveal that most of these models are still largely incapable of interpreting counter-intuitive scenarios. We hope that ROME will spur further investigations on reasoning beyond commonsense knowledge in vision-language research.
format text
author ZHOU, Kankan
LAI, Eason
YEONG, Au Wei Bin
MOURATIDIS, Kyriakos
JIANG, Jing
author_facet ZHOU, Kankan
LAI, Eason
YEONG, Au Wei Bin
MOURATIDIS, Kyriakos
JIANG, Jing
author_sort ZHOU, Kankan
title ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_short ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_full ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_fullStr ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_full_unstemmed ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_sort rome: evaluating pre-trained vision-language models on reasoning beyond visual common sense
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8352
https://ink.library.smu.edu.sg/context/sis_research/article/9355/viewcontent/2023.findings_emnlp.683.pdf
_version_ 1787136839604240384