ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense

Humans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a visio...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHOU, Kankan, LAI, Eason, YEONG, Au Wei Bin, MOURATIDIS, Kyriakos, JIANG, Jing
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Artificial Intelligence and Robotics
Online Access:	https://ink.library.smu.edu.sg/sis_research/8352 https://ink.library.smu.edu.sg/context/sis_research/article/9355/viewcontent/2023.findings_emnlp.683.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9355
record_format	dspace
spelling	sg-smu-ink.sis_research-93552023-12-19T03:36:21Z ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense ZHOU, Kankan LAI, Eason YEONG, Au Wei Bin MOURATIDIS, Kyriakos JIANG, Jing Humans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a vision-language model, whose reasoning could gravitate towards the common scenario that the fish is inside the bowl, despite the visual input. In this paper, we introduce a novel probing dataset named ROME (reasoning beyond commonsense knowledge) to evaluate whether the state-of-the-art pre-trained vision-language models have the reasoning capability to correctly interpret counter-intuitive content. ROME contains images that defy commonsense knowledge with regards to color, shape, material, size and positional relation. Experiments on the state-of-the-art pre-trained vision-language models reveal that most of these models are still largely incapable of interpreting counter-intuitive scenarios. We hope that ROME will spur further investigations on reasoning beyond commonsense knowledge in vision-language research. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8352 info:doi/10.48550/arXiv.2310.19301 https://ink.library.smu.edu.sg/context/sis_research/article/9355/viewcontent/2023.findings_emnlp.683.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Artificial Intelligence and Robotics
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Artificial Intelligence and Robotics
spellingShingle	Artificial Intelligence and Robotics ZHOU, Kankan LAI, Eason YEONG, Au Wei Bin MOURATIDIS, Kyriakos JIANG, Jing ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
description	Humans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a vision-language model, whose reasoning could gravitate towards the common scenario that the fish is inside the bowl, despite the visual input. In this paper, we introduce a novel probing dataset named ROME (reasoning beyond commonsense knowledge) to evaluate whether the state-of-the-art pre-trained vision-language models have the reasoning capability to correctly interpret counter-intuitive content. ROME contains images that defy commonsense knowledge with regards to color, shape, material, size and positional relation. Experiments on the state-of-the-art pre-trained vision-language models reveal that most of these models are still largely incapable of interpreting counter-intuitive scenarios. We hope that ROME will spur further investigations on reasoning beyond commonsense knowledge in vision-language research.
format	text
author	ZHOU, Kankan LAI, Eason YEONG, Au Wei Bin MOURATIDIS, Kyriakos JIANG, Jing
author_facet	ZHOU, Kankan LAI, Eason YEONG, Au Wei Bin MOURATIDIS, Kyriakos JIANG, Jing
author_sort	ZHOU, Kankan
title	ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_short	ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_full	ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_fullStr	ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_full_unstemmed	ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense
title_sort	rome: evaluating pre-trained vision-language models on reasoning beyond visual common sense
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/8352 https://ink.library.smu.edu.sg/context/sis_research/article/9355/viewcontent/2023.findings_emnlp.683.pdf
_version_	1787136839604240384

ROME: Evaluating pre-trained vision-language models on reasoning beyond visual common sense

Similar Items