Q-instruct: improving low-level visual abilities for multi-modality foundation models
Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level...
Saved in:
Main Authors: | , , , , , , , , , , , , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/178464 http://arxiv.org/abs/2311.06783v1 https://openaccess.thecvf.com/content/CVPR2024/papers/Wu_Q-Instruct_Improving_Low-level_Visual_Abilities_for_Multi-modality_Foundation_Models_CVPR_2024_paper.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Multi-modality foundation models, as represented by GPT-4V, have brought a
new paradigm for low-level visual perception and understanding tasks, that can
respond to a broad range of natural human instructions in a model. While
existing foundation models have shown exciting potentials on low-level visual
tasks, their related abilities are still preliminary and need to be improved.
In order to enhance these models, we conduct a large-scale subjective
experiment collecting a vast number of real human feedbacks on low-level
vision. Each feedback follows a pathway that starts with a detailed description
on the low-level visual appearance (*e.g. clarity, color, brightness* of an
image, and ends with an overall conclusion, with an average length of 45 words.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on
18,973 images with diverse low-level appearance. Moreover, to enable foundation
models to robustly respond to diverse types of questions, we design a
GPT-participated conversion to process these feedbacks into diverse-format 200K
instruction-response pairs. Experimental results indicate that the
**Q-Instruct** consistently elevates low-level perception and understanding
abilities across several foundational models. We anticipate that our datasets
can pave the way for a future that general intelligence can perceive,
understand low-level visual appearance and evaluate visual quality like a
human. Our dataset, model zoo, and demo is published at:
https://q-future.github.io/Q-Instruct. |
---|