Q-align: teaching LMMs for visual scoring via discrete text-defined levels
The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide rang...
Saved in:
Main Authors: | , , , , , , , , , , , , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/178466 http://arxiv.org/abs/2312.17090v1 https://openreview.net/forum?id=PHjkVjR78A https://icml.cc/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-178466 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1784662024-07-01T01:20:57Z Q-align: teaching LMMs for visual scoring via discrete text-defined levels Wu, Haoning Zhang, Zicheng Zhang, Weixia Chen, Chaofeng Liao, Liang Li, Chunyi Gao, Yixuan Wang, Annan Zhang, Erli Sun, Wenxiu Yan, Qiong Min, Xiongkuo Zhai, Guangtao Lin, Weisi College of Computing and Data Science 41st International Conference on Machine Learning (ICML 2024) S-Lab Computer and Information Science Large multi-modality models Computer vision The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align. 2024-07-01T01:14:42Z 2024-07-01T01:14:42Z 2024 Conference Paper Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., Yan, Q., Min, X., Zhai, G. & Lin, W. (2024). Q-align: teaching LMMs for visual scoring via discrete text-defined levels. 41st International Conference on Machine Learning (ICML 2024). 2640-3498 https://hdl.handle.net/10356/178466 http://arxiv.org/abs/2312.17090v1 https://openreview.net/forum?id=PHjkVjR78A https://icml.cc/ PMLR 235 en © The Author(s). Published by ICML. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Large multi-modality models Computer vision |
spellingShingle |
Computer and Information Science Large multi-modality models Computer vision Wu, Haoning Zhang, Zicheng Zhang, Weixia Chen, Chaofeng Liao, Liang Li, Chunyi Gao, Yixuan Wang, Annan Zhang, Erli Sun, Wenxiu Yan, Qiong Min, Xiongkuo Zhai, Guangtao Lin, Weisi Q-align: teaching LMMs for visual scoring via discrete text-defined levels |
description |
The explosion of visual content available online underscores the requirement
for an accurate machine assessor to robustly evaluate scores across diverse
types of visual contents. While recent studies have demonstrated the
exceptional potentials of large multi-modality models (LMMs) on a wide range of
related fields, in this work, we explore how to teach them for visual rating
aligned with human opinions. Observing that human raters only learn and judge
discrete text-defined levels in subjective studies, we propose to emulate this
subjective process and teach LMMs with text-defined rating levels instead of
scores. The proposed Q-Align achieves state-of-the-art performance on image
quality assessment (IQA), image aesthetic assessment (IAA), as well as video
quality assessment (VQA) tasks under the original LMM structure. With the
syllabus, we further unify the three tasks into one model, termed the OneAlign.
In our experiments, we demonstrate the advantage of the discrete-level-based
syllabus over direct-score-based variants for LMMs. Our code and the
pre-trained weights are released at https://github.com/Q-Future/Q-Align. |
author2 |
College of Computing and Data Science |
author_facet |
College of Computing and Data Science Wu, Haoning Zhang, Zicheng Zhang, Weixia Chen, Chaofeng Liao, Liang Li, Chunyi Gao, Yixuan Wang, Annan Zhang, Erli Sun, Wenxiu Yan, Qiong Min, Xiongkuo Zhai, Guangtao Lin, Weisi |
format |
Conference or Workshop Item |
author |
Wu, Haoning Zhang, Zicheng Zhang, Weixia Chen, Chaofeng Liao, Liang Li, Chunyi Gao, Yixuan Wang, Annan Zhang, Erli Sun, Wenxiu Yan, Qiong Min, Xiongkuo Zhai, Guangtao Lin, Weisi |
author_sort |
Wu, Haoning |
title |
Q-align: teaching LMMs for visual scoring via discrete text-defined levels |
title_short |
Q-align: teaching LMMs for visual scoring via discrete text-defined levels |
title_full |
Q-align: teaching LMMs for visual scoring via discrete text-defined levels |
title_fullStr |
Q-align: teaching LMMs for visual scoring via discrete text-defined levels |
title_full_unstemmed |
Q-align: teaching LMMs for visual scoring via discrete text-defined levels |
title_sort |
q-align: teaching lmms for visual scoring via discrete text-defined levels |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/178466 http://arxiv.org/abs/2312.17090v1 https://openreview.net/forum?id=PHjkVjR78A https://icml.cc/ |
_version_ |
1806059793622761472 |