Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels
- URL: http://arxiv.org/abs/2312.17090v1
- Date: Thu, 28 Dec 2023 16:10:25 GMT
- Title: Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels
- Authors: Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao,
Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan,
Xiongkuo Min, Guangtao Zhai, Weisi Lin
- Abstract summary: We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
- Score: 95.44077384918725
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The explosion of visual content available online underscores the requirement
for an accurate machine assessor to robustly evaluate scores across diverse
types of visual contents. While recent studies have demonstrated the
exceptional potentials of large multi-modality models (LMMs) on a wide range of
related fields, in this work, we explore how to teach them for visual rating
aligned with human opinions. Observing that human raters only learn and judge
discrete text-defined levels in subjective studies, we propose to emulate this
subjective process and teach LMMs with text-defined rating levels instead of
scores. The proposed Q-Align achieves state-of-the-art performance on image
quality assessment (IQA), image aesthetic assessment (IAA), as well as video
quality assessment (VQA) tasks under the original LMM structure. With the
syllabus, we further unify the three tasks into one model, termed the OneAlign.
In our experiments, we demonstrate the advantage of the discrete-level-based
syllabus over direct-score-based variants for LMMs. Our code and the
pre-trained weights are released at https://github.com/Q-Future/Q-Align.
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA)
Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.