Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment
- URL: http://arxiv.org/abs/2512.09555v1
- Date: Wed, 10 Dec 2025 11:50:42 GMT
- Title: Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment
- Authors: Yuan Li, Zitang Sun, Yen-ju Chen, Shin'ya Nishida,
- Abstract summary: We analyze the factors that cause contradictory assessments and instability.<n>We introduce a two-stage tuning method that explicitly separates visual perception from quality inference.
- Score: 7.969076042774561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
Related papers
- Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z) - Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment [25.916354359994624]
We propose Q-Hawkeye, an RL-based reliable visual policy optimization framework.<n>Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts.<n>We introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence.
arXiv Detail & Related papers (2026-01-30T12:42:32Z) - Understanding Pure Textual Reasoning for Blind Image Quality Assessment [4.971551895830219]
Textual reasoning has been widely adopted in Blind Image Quality Assessment (BIQA)<n>It remains unclear how textual information contributes to quality prediction and to what extent text can represent the score-related image contents.
arXiv Detail & Related papers (2026-01-05T11:43:56Z) - Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment [10.701522670464463]
multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments.<n>We propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage.<n>We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder.
arXiv Detail & Related papers (2025-06-03T10:11:51Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Interpreting Predictive Probabilities: Model Confidence or Human Label
Variation? [27.226997687210044]
We identify two main perspectives that drive starkly different evaluation protocols.
We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems.
We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.
arXiv Detail & Related papers (2024-02-25T15:00:13Z) - DifFIQA: Face Image Quality Assessment Using Denoising Diffusion
Probabilistic Models [1.217503190366097]
Face image quality assessment (FIQA) techniques aim to mitigate these performance degradations.
We present a powerful new FIQA approach, named DifFIQA, which relies on denoising diffusion probabilistic models (DDPM)
Because the diffusion-based perturbations are computationally expensive, we also distill the knowledge encoded in DifFIQA into a regression-based quality predictor, called DifFIQA(R)
arXiv Detail & Related papers (2023-05-09T21:03:13Z) - Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets.
We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z) - Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards
Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent.
Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally.
We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z) - Task-Specific Normalization for Continual Learning of Blind Image
Quality Models [105.03239956378465]
We present a simple yet effective continual learning method for blind image quality assessment (BIQA)
The key step in our approach is to freeze all convolution filters of a pre-trained deep neural network (DNN) for an explicit promise of stability.
We assign each new IQA dataset (i.e., task) a prediction head, and load the corresponding normalization parameters to produce a quality score.
The final quality estimate is computed by black a weighted summation of predictions from all heads with a lightweight $K$-means gating mechanism.
arXiv Detail & Related papers (2021-07-28T15:21:01Z) - Uncertainty-Aware Blind Image Quality Assessment in the Laboratory and
Wild [98.48284827503409]
We develop a textitunified BIQA model and an approach of training it for both synthetic and realistic distortions.
We employ the fidelity loss to optimize a deep neural network for BIQA over a large number of such image pairs.
Experiments on six IQA databases show the promise of the learned method in blindly assessing image quality in the laboratory and wild.
arXiv Detail & Related papers (2020-05-28T13:35:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.