Q-bench: a benchmark for general-purpose foundation models on low-level vision
The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understandin...
Saved in:
Main Authors: | , , , , , , , , , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/178462 http://arxiv.org/abs/2309.14181v3 https://openreview.net/forum?id=0V5TVt9bk0 https://iclr.cc/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-178462 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1784622024-06-21T06:46:39Z Q-bench: a benchmark for general-purpose foundation models on low-level vision Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi College of Computing and Data Science 12th International Conference on Learning Representations (ICLR 2024) S-Lab Computer and Information Science Multi-modality large language models Computer vision The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench. 2024-06-20T07:50:01Z 2024-06-20T07:50:01Z 2024 Conference Paper Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Li, C., Sun, W., Yan, Q., Zhai, G. & Lin, W. (2024). Q-bench: a benchmark for general-purpose foundation models on low-level vision. 12th International Conference on Learning Representations (ICLR 2024). https://hdl.handle.net/10356/178462 http://arxiv.org/abs/2309.14181v3 https://openreview.net/forum?id=0V5TVt9bk0 https://iclr.cc/ en 10.21979/N9/M41ERD © 2024 ICLR. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Multi-modality large language models Computer vision |
spellingShingle |
Computer and Information Science Multi-modality large language models Computer vision Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi Q-bench: a benchmark for general-purpose foundation models on low-level vision |
description |
The rapid evolution of Multi-modality Large Language Models (MLLMs) has
catalyzed a shift in computer vision from specialized models to general-purpose
foundation models. Nevertheless, there is still an inadequacy in assessing the
abilities of MLLMs on low-level visual perception and understanding. To address
this gap, we present Q-Bench, a holistic benchmark crafted to systematically
evaluate potential abilities of MLLMs on three realms: low-level visual
perception, low-level visual description, and overall visual quality
assessment. a) To evaluate the low-level perception ability, we construct the
LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped
with a human-asked question focusing on its low-level attributes. We then
measure the correctness of MLLMs on answering these questions. b) To examine
the description ability of MLLMs on low-level information, we propose the
LLDescribe dataset consisting of long expert-labelled golden low-level text
descriptions on 499 images, and a GPT-involved comparison pipeline between
outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we
further measure their visual quality assessment ability to align with human
opinion scores. Specifically, we design a softmax-based strategy that enables
MLLMs to predict quantifiable quality scores, and evaluate them on various
existing image quality assessment (IQA) datasets. Our evaluation across the
three abilities confirms that MLLMs possess preliminary low-level visual
skills. However, these skills are still unstable and relatively imprecise,
indicating the need for specific enhancements on MLLMs towards these abilities.
We hope that our benchmark can encourage the research community to delve deeper
to discover and enhance these untapped potentials of MLLMs. Project Page:
https://q-future.github.io/Q-Bench. |
author2 |
College of Computing and Data Science |
author_facet |
College of Computing and Data Science Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi |
format |
Conference or Workshop Item |
author |
Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi |
author_sort |
Wu, Haoning |
title |
Q-bench: a benchmark for general-purpose foundation models on low-level vision |
title_short |
Q-bench: a benchmark for general-purpose foundation models on low-level vision |
title_full |
Q-bench: a benchmark for general-purpose foundation models on low-level vision |
title_fullStr |
Q-bench: a benchmark for general-purpose foundation models on low-level vision |
title_full_unstemmed |
Q-bench: a benchmark for general-purpose foundation models on low-level vision |
title_sort |
q-bench: a benchmark for general-purpose foundation models on low-level vision |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/178462 http://arxiv.org/abs/2309.14181v3 https://openreview.net/forum?id=0V5TVt9bk0 https://iclr.cc/ |
_version_ |
1806059908397793280 |