Q-align: teaching LMMs for visual scoring via discrete text-defined levels

Q-align: teaching LMMs for visual scoring via discrete text-defined levels

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide rang...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wu, Haoning, Zhang, Zicheng, Zhang, Weixia, Chen, Chaofeng, Liao, Liang, Li, Chunyi, Gao, Yixuan, Wang, Annan, Zhang, Erli, Sun, Wenxiu, Yan, Qiong, Min, Xiongkuo, Zhai, Guangtao, Lin, Weisi
Other Authors:	College of Computing and Data Science
Format:	Conference or Workshop Item
Language:	English
Published:	2024
Subjects:	Computer and Information Science Large multi-modality models Computer vision
Online Access:	https://hdl.handle.net/10356/178466 http://arxiv.org/abs/2312.17090v1 https://openreview.net/forum?id=PHjkVjR78A https://icml.cc/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Similar Items

Q-bench: a benchmark for general-purpose foundation models on low-level vision
by: Wu, Haoning, et al.
Published: (2024)

Q-instruct: improving low-level visual abilities for multi-modality foundation models
by: Wu, Haoning, et al.
Published: (2024)

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives
by: Wu, Haoning, et al.
Published: (2024)

FAST-VQA: efficient end-to-end video quality assessment with fragment sampling
by: Wu, Haoning, et al.
Published: (2024)

Evaluation of modal stress resultants in freely vibrating plates
by: Wang, C.M., et al.
Published: (2014)

Collaborative cross-modal fusion with Large Language Model for recommendation
by: LIU, Zhongzhou, et al.
Published: (2024)

Exploring the effectiveness of video perceptual representation in blind video quality assessment
by: Liao, Liang, et al.
Published: (2024)

Neighbourhood representative sampling for efficient end-to-end video quality assessment
by: Wu, Haoning, et al.
Published: (2024)

Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection
by: Gao, Wei, et al.
Published: (2021)

汉语情态助动词的主观性和主观化 = THE SUBJECTIVITY AND SUBJECTIFICATION OF MODAL AUXILIARIES IN CHINESE
by: 杨黎黎, et al.
Published: (2015)

AimigoTutor - tutoring application using multi-modal capabilities
by: Nguyen, Viet Hoang
Published: (2024)

Blind video quality prediction by uncovering human video perceptual representation
by: Liao, Liang, et al.
Published: (2024)

A psychovisual quality metric in free-energy principle
by: Lin, Weisi, et al.
Published: (2013)

Cross-modal recipe retrieval with stacked attention model
by: CHEN, Jing-Jing, et al.
Published: (2018)

NVP-HRI: zero shot natural voice and posture-based human–robot interaction via large language model
by: Lai, Yuzhi, et al.
Published: (2025)

Epistemic modality in TED talks on education
by: Ton Nu, My Nhat, et al.
Published: (2019)

Analytic learning in multi-modal continual test-time adaptation
by: Zhang, Yufei
Published: (2025)

Retrieval augmented recipe generation
by: LIU, Guoshan, et al.
Published: (2025)

Unifying text, tables, and images for multimodal question answering
by: LUO, Haohao, et al.
Published: (2023)

Alleviating the inconsistency of multimodal data in cross-modal retrieval
by: Li, Tieying, et al.
Published: (2024)

Temporal sentence grounding in videos: a survey and future directions
by: Zhang, Hao, et al.
Published: (2023)

Inference acceleration of large language models
by: Zhang, Boyu
Published: (2024)

Vision-language-model-based video quality assessment
by: Zhang, Erli
Published: (2024)

Cross-modal recipe retrieval: How to cook this dish?
by: CHEN, Jingjing, et al.
Published: (2017)

Learning language to symbol and language to vision mapping for visual grounding
by: He, Su, et al.
Published: (2022)

Can online reviews reveal a product's true quality? Empirical findings analytical modeling of online word-of-mouth communication
by: HU, Nan, et al.
Published: (2006)

Modalities and Multimodalities
by: Carnielli, Walter, et al.
Published: (2017)

A characterisation of open bisimilarity using an intuitionistic modal logic
by: Ahrn, Ki Yung, et al.
Published: (2018)

Geographic mapping with unsupervised multi-modal representation learning from VHR images and POIs
by: Bai, Lubin, et al.
Published: (2023)

Fusing heterogeneous modalities for video and image re-ranking
by: TAN, Hung-Khoon, et al.
Published: (2011)

QuantfolioX: portfolio management application using large language model technology
by: Teo, Charlotte Xuan Qin
Published: (2024)

The verb in Philippine English: A preliminary analysis of modal would
by: Bautista, Ma. Lourdes S.
Published: (2004)

FHENet: lightweight feature hierarchical exploration network for real-time rail surface defect inspection in RGB-D images
by: Zhou, Wujie, et al.
Published: (2023)

Online multi-face tracking with multi-modality cascaded matching
by: Weng, Zhenyu, et al.
Published: (2024)

An empirical study on adaptation methods for large-scale vision-language models
by: Wang, Annan
Published: (2023)

Equilibrium characterization and incentives in large games
by: ZHANG LUYI
Published: (2010)

A stretchable and transparent electrode based on PEGylated silk fibroin for in vivo dual-modal neural-vascular activity probing
by: Cui, Yajing, et al.
Published: (2022)

Don’t just say “I don’t know”! Self-aligning Large Language Models for responding to unknown questions with explanations
by: DENG, Yang, et al.
Published: (2024)

Lightweight salient object detection in optical remote-sensing images via semantic matching and edge alignment
by: Li, Gongyang, et al.
Published: (2023)

LKAW: a robust watermarking method based on large kernel convolution and adaptive weight assignment
by: Zhang, Xiaorui, et al.
Published: (2023)