Q-bench: a benchmark for general-purpose foundation models on low-level vision

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understandin...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wu, Haoning, Zhang, Zicheng, Zhang, Erli, Chen, Chaofeng, Liao, Liang, Wang, Annan, Li, Chunyi, Sun, Wenxiu, Yan, Qiong, Zhai, Guangtao, Lin, Weisi
Other Authors:	College of Computing and Data Science
Format:	Conference or Workshop Item
Language:	English
Published:	2024
Subjects:	Computer and Information Science Multi-modality large language models Computer vision
Online Access:	https://hdl.handle.net/10356/178462 http://arxiv.org/abs/2309.14181v3 https://openreview.net/forum?id=0V5TVt9bk0 https://iclr.cc/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-178462
record_format	dspace
spelling	sg-ntu-dr.10356-1784622024-06-21T06:46:39Z Q-bench: a benchmark for general-purpose foundation models on low-level vision Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi College of Computing and Data Science 12th International Conference on Learning Representations (ICLR 2024) S-Lab Computer and Information Science Multi-modality large language models Computer vision The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench. 2024-06-20T07:50:01Z 2024-06-20T07:50:01Z 2024 Conference Paper Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Li, C., Sun, W., Yan, Q., Zhai, G. & Lin, W. (2024). Q-bench: a benchmark for general-purpose foundation models on low-level vision. 12th International Conference on Learning Representations (ICLR 2024). https://hdl.handle.net/10356/178462 http://arxiv.org/abs/2309.14181v3 https://openreview.net/forum?id=0V5TVt9bk0 https://iclr.cc/ en 10.21979/N9/M41ERD © 2024 ICLR. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Multi-modality large language models Computer vision
spellingShingle	Computer and Information Science Multi-modality large language models Computer vision Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi Q-bench: a benchmark for general-purpose foundation models on low-level vision
description	The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.
author2	College of Computing and Data Science
author_facet	College of Computing and Data Science Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi
format	Conference or Workshop Item
author	Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Li, Chunyi Sun, Wenxiu Yan, Qiong Zhai, Guangtao Lin, Weisi
author_sort	Wu, Haoning
title	Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_short	Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_full	Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_fullStr	Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_full_unstemmed	Q-bench: a benchmark for general-purpose foundation models on low-level vision
title_sort	q-bench: a benchmark for general-purpose foundation models on low-level vision
publishDate	2024
url	https://hdl.handle.net/10356/178462 http://arxiv.org/abs/2309.14181v3 https://openreview.net/forum?id=0V5TVt9bk0 https://iclr.cc/
_version_	1806059908397793280

Q-bench: a benchmark for general-purpose foundation models on low-level vision

Similar Items