Related papers: iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA

iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA

URL: http://arxiv.org/abs/2510.17332v1
Date: Mon, 20 Oct 2025 09:26:12 GMT
Title: iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA
Authors: Zhaoran Zhao, Xinli Yue, Jianhui Sun, Yuhao Xie, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng,
Abstract summary: iDETEX is a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description.<n>We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks.
Score: 10.857047397246598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.

Related papers

AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment [69.06977852423564]
Image quality assessment (IQA) reflects both the quantification and interpretation of perceptual quality rooted in the human visual system.<n>AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution.<n>To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents.
arXiv Detail & Related papers (2025-09-30T09:37:01Z)
Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs [60.0988889107102]
We explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs)<n>We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality that provides principles for the transformation.<n>We develop an agentic system (Q-Mirror) which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement.
arXiv Detail & Related papers (2025-09-29T05:22:10Z)
Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots [13.26825865228582]
We propose eight uncertainty metrics and five quality metrics specifically designed for VLA models for robotic manipulation tasks.<n>We assess their effectiveness through a large-scale empirical study involving 908 successful task executions from three state-of-the-art VLA models.
arXiv Detail & Related papers (2025-07-22T22:15:59Z)
Evaluating Multimodal Large Language Models on Educational Textbook Question Answering [3.4729524020941063]
Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested.<n>This work presents the first evaluation of state-of-the-art MLLMs, including LLaVA-1.5 and LLaMA 3.2-Vision, on the textbook question answering (TQA) task using the CK12-QA dataset.
arXiv Detail & Related papers (2025-06-18T19:31:35Z)
Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment [10.701522670464463]
multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments.<n>We propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage.<n>We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder.
arXiv Detail & Related papers (2025-06-03T10:11:51Z)
Scaling-up Perceptual Video Quality Assessment [54.691252495691955]
We show how to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases.<n>Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge.<n>Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
arXiv Detail & Related papers (2025-05-28T16:24:52Z)
Teaching LMMs for Image Quality Scoring and Interpreting [71.1335005098584]
We propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables image quality scoring and interpreting simultaneously.<n>Q-SiT is the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini.<n> Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.
arXiv Detail & Related papers (2025-03-12T09:39:33Z)
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework that enables more human-aligned, holistic evaluation of AI-generated images.<n>By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores.
arXiv Detail & Related papers (2025-02-21T03:05:45Z)
Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models [93.91086467402323]
Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA) designed to efficiently adapt the visual-language pre-trained model, CLIP, to IQA tasks.<n> GRMP-IQA consists of two core modules: (i) Meta-Prompt Pre-training Module and (ii) Quality-Aware Gradient Regularization.
arXiv Detail & Related papers (2024-09-09T07:26:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.