Related papers: Revisiting Vision Language Foundations for No-Reference Image Quality Assessment

Revisiting Vision Language Foundations for No-Reference Image Quality Assessment

URL: http://arxiv.org/abs/2509.17374v1
Date: Mon, 22 Sep 2025 06:24:42 GMT
Title: Revisiting Vision Language Foundations for No-Reference Image Quality Assessment
Authors: Ankit Yadav, Ta Duc Huy, Lingqiao Liu,
Abstract summary: Large-scale vision language pre-training has recently shown promise for no-reference image-quality assessment (NR-IQA)<n>We present the first systematic evaluation of six prominent pretrained backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet, for the task of No-Reference Image Quality Assessment (NR-IQA)<n>Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of image quality assessment models.
Score: 31.550239698285058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale vision language pre-training has recently shown promise for no-reference image-quality assessment (NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work, we present the first systematic evaluation of six prominent pretrained backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet, for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of image quality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design, and achieving new state-of-the-art SRCC on CLIVE, KADID10K, and AGIQA3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.

Related papers

Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling [53.74410422225995]
Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience.<n>Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction.<n>This paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced underlinelayerunderlineinteraction and MoE-based underlinefeature dunderlineecoupling, termed textbf(Life-IQA).
arXiv Detail & Related papers (2025-11-24T11:59:55Z)
Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z)
Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment [22.184690568393126]
Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training.<n>We propose a multi-stage RFT IQA framework (-IQA)<n>The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks.
arXiv Detail & Related papers (2025-08-04T22:46:10Z)
EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment [68.77813885751308]
EyeSimVQA is a novel VQA framework that incorporates free-energy-based self-repair.<n>We show EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-06-13T08:00:54Z)
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank [23.613534906344753]
We introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model.<n>We train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality.<n>In experiments, VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models.
arXiv Detail & Related papers (2025-05-20T14:56:50Z)
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [27.26829134776367]
Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation.<n>We propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO)<n>We show that Q-Insight substantially outperforms existing state-of-the-art methods in both score regression and degradation perception tasks.
arXiv Detail & Related papers (2025-03-28T17:59:54Z)
IQPFR: An Image Quality Prior for Blind Face Restoration and Beyond [56.99331967165238]
Blind Face Restoration (BFR) addresses the challenge of reconstructing degraded low-quality (LQ) facial images into high-quality (HQ) outputs.<n>We propose a novel framework that incorporates an Image Quality Prior (IQP) derived from No-Reference Image Quality Assessment (NR-IQA) models.<n>Our method outperforms state-of-the-art techniques across multiple benchmarks.
arXiv Detail & Related papers (2025-03-12T11:39:51Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models [80.79438689784958]
We introduce Q-Boost, a strategy designed to enhance low-level MLLMs in image quality assessment (IQA) and video quality assessment (VQA) tasks. Q-Boost innovates by incorporating a middle ground' approach through $neutral$ prompts, allowing for a more balanced and detailed assessment. The experimental results show that the low-level MLLMs exhibit outstanding zeros-shot performance on the IQA/VQA tasks equipped with the Q-Boost strategy.
arXiv Detail & Related papers (2023-12-23T17:02:25Z)
TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z)
No-Reference Image Quality Assessment via Feature Fusion and Multi-Task Learning [29.19484863898778]
Blind or no-reference image quality assessment (NR-IQA) is a fundamental, unsolved, and yet challenging problem. We propose a simple and yet effective general-purpose no-reference (NR) image quality assessment framework based on multi-task learning. Our model employs distortion types as well as subjective human scores to predict image quality.
arXiv Detail & Related papers (2020-06-06T05:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.