Q-bench: a benchmark for general-purpose foundation models on low-level vision

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understandin...

Full description

Saved in:
Bibliographic Details
Main Authors: Wu, Haoning, Zhang, Zicheng, Zhang, Erli, Chen, Chaofeng, Liao, Liang, Wang, Annan, Li, Chunyi, Sun, Wenxiu, Yan, Qiong, Zhai, Guangtao, Lin, Weisi
Other Authors: College of Computing and Data Science
Format: Conference or Workshop Item
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/178462
http://arxiv.org/abs/2309.14181v3
https://openreview.net/forum?id=0V5TVt9bk0
https://iclr.cc/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-178462
record_format dspace
spelling sg-ntu-dr.10356-1784622024-06-21T06:46:39Z Q-bench: a benchmark for general-purpose foundation models on low-level vision Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi College of Computing and Data Science 12th International Conference on Learning Representations (ICLR 2024) S-Lab Computer and Information Science Multi-modality large language models Computer vision The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench. 2024-06-20T07:50:01Z 2024-06-20T07:50:01Z 2024 Conference Paper Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Li, C., Sun, W., Yan, Q., Zhai, G. & Lin, W. (2024). Q-bench: a benchmark for general-purpose foundation models on low-level vision. 12th International Conference on Learning Representations (ICLR 2024). https://hdl.handle.net/10356/178462 http://arxiv.org/abs/2309.14181v3 https://openreview.net/forum?id=0V5TVt9bk0 https://iclr.cc/ en 10.21979/N9/M41ERD © 2024 ICLR. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Multi-modality large language models
Computer vision
spellingShingle Computer and Information Science
Multi-modality large language models
Computer vision
Wu, Haoning
Zhang, Zicheng
Zhang, Erli
Chen, Chaofeng
Liao, Liang
Wang, Annan
Li, Chunyi
Sun, Wenxiu
Yan, Qiong
Zhai, Guangtao
Lin, Weisi
Q-bench: a benchmark for general-purpose foundation models on low-level vision
description The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.
author2 College of Computing and Data Science
author_facet College of Computing and Data Science
Wu, Haoning
Zhang, Zicheng
Zhang, Erli
Chen, Chaofeng
Liao, Liang
Wang, Annan
Li, Chunyi
Sun, Wenxiu
Yan, Qiong
Zhai, Guangtao
Lin, Weisi
format Conference or Workshop Item
author Wu, Haoning
Zhang, Zicheng
Zhang, Erli
Chen, Chaofeng
Liao, Liang
Wang, Annan
Li, Chunyi
Sun, Wenxiu
Yan, Qiong
Zhai, Guangtao
Lin, Weisi
author_sort Wu, Haoning
title Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_short Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_full Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_fullStr Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_full_unstemmed Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_sort q-bench: a benchmark for general-purpose foundation models on low-level vision
publishDate 2024
url https://hdl.handle.net/10356/178462
http://arxiv.org/abs/2309.14181v3
https://openreview.net/forum?id=0V5TVt9bk0
https://iclr.cc/
_version_ 1806059908397793280