Related papers: VisualCritic: Making LMMs Perceive Visual Quality Like Humans

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

URL: http://arxiv.org/abs/2403.12806v1
Date: Tue, 19 Mar 2024 15:07:08 GMT
Title: VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Authors: Zhipeng Huang, Zhizheng Zhang, Yiting Lu, Zheng-Jun Zha, Zhibo Chen, Baining Guo,
Abstract summary: We present VisualCritic, the first LMM for broad-spectrum image subjective quality assessment. VisualCritic can be used across diverse data right out of box, without any requirements of dataset-specific adaptation operations.
Score: 65.59779450136399
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: At present, large multimodal models (LMMs) have exhibited impressive generalization capabilities in understanding and generating visual signals. However, they currently still lack sufficient capability to perceive low-level visual quality akin to human perception. Can LMMs achieve this and show the same degree of generalization in this regard? If so, not only could the versatility of LMMs be further enhanced, but also the challenge of poor cross-dataset performance in the field of visual quality assessment could be addressed. In this paper, we explore this question and provide the answer "Yes!". As the result of this initial exploration, we present VisualCritic, the first LMM for broad-spectrum image subjective quality assessment. VisualCritic can be used across diverse data right out of box, without any requirements of dataset-specific adaptation operations like conventional specialist models. As an instruction-following LMM, VisualCritic enables new capabilities of (1) quantitatively measuring the perceptual quality of given images in terms of their Mean Opinion Score (MOS), noisiness, colorfulness, sharpness, and other numerical indicators, (2) qualitatively evaluating visual quality and providing explainable descriptions, (3) discerning whether a given image is AI-generated or photographic. Extensive experiments demonstrate the efficacy of VisualCritic by comparing it with other open-source LMMs and conventional specialist models over both AI-generated and photographic images.

Related papers

Towards Understanding Graphical Perception in Large Multimodal Models [80.44471730672801]
We leverage the theory of graphical perception to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three levels (chart, visual element, and pixel)
arXiv Detail & Related papers (2025-03-13T20:13:39Z)
Teaching LMMs for Image Quality Scoring and Interpreting [71.1335005098584]
We propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables image quality scoring and interpreting simultaneously. Q-SiT is the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.
arXiv Detail & Related papers (2025-03-12T09:39:33Z)
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data [35.229595049396245]
We propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers.
arXiv Detail & Related papers (2025-02-19T19:05:45Z)
Mitigating Perception Bias: A Training-Free Approach to Enhance LMM for Image Quality Assessment [18.622560025505233]
We propose a training-free debiasing framework for image quality assessment. We first explore several semantic-preserving distortions that can significantly degrade image quality. We then apply these specific distortions to the query or test images. During quality inference, both a query image and its corresponding degraded version are fed to the LMM. All degraded images are consistently rated as poor quality, regardless of their semantic difference.
arXiv Detail & Related papers (2024-11-19T15:00:59Z)
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models [31.47100708645748]
Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks. We propose a novel, challenging, and efficient benchmark based on composite blocks, called ComBo, which provides a disentangled evaluation framework. We find that although LMMs exhibit acceptable generalization ability in learning new categories, there are still gaps compared to humans in many ways.
arXiv Detail & Related papers (2024-09-03T02:55:36Z)
Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding. Q-Ground combines large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z)
A-Bench: Are LMMs Masters at Evaluating AI-generated Images? [78.3699767628502]
A-Bench is a benchmark designed to diagnose whether multi-modal models (LMMs) are masters at evaluating AI-generated images (AIGIs) Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs.
arXiv Detail & Related papers (2024-06-05T08:55:02Z)
Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA) Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
2AFC Prompting of Large Multimodal Models for Image Quality Assessment [38.86162365208038]
Two-alternative forced choice (2AFC) prompting is widely regarded as the most reliable way of collecting human opinions of visual quality. Global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation.
arXiv Detail & Related papers (2024-02-02T06:05:18Z)
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.