MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs
- URL: http://arxiv.org/abs/2510.01691v1
- Date: Thu, 02 Oct 2025 05:42:00 GMT
- Title: MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs
- Authors: Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu,
- Abstract summary: We introduce MedQ-Bench, a comprehensive benchmark for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs)<n>The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments.<n>Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use.
- Score: 39.33140500353129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.
Related papers
- Image Quality Assessment for Machines: Paradigm, Large-scale Database, and Models [60.356842878501254]
Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions.<n>We propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance.
arXiv Detail & Related papers (2025-08-27T13:07:24Z) - MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration [76.94293572477379]
We propose a multi-dimensional image quality assessment (MDIQA) framework.<n>We model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions.<n>When the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models.
arXiv Detail & Related papers (2025-08-23T03:17:14Z) - MedIQA: A Scalable Foundation Model for Prompt-Driven Medical Image Quality Assessment [26.185840831950063]
Existing medical IQA methods, however, struggle to generalize across diverse modalities and clinical scenarios.<n>We introduce MedIQA, the first comprehensive foundation model for medical IQA, designed to handle variability in image dimensions, modalities, anatomical regions, and types.
arXiv Detail & Related papers (2025-07-25T07:02:47Z) - PhotIQA: A photoacoustic image data set with image quality ratings [7.753621023890248]
PhotIQA is a data set consisting of 1134 reconstructed photoacoustic (PA) images rated by 2 experts across five quality properties.<n>Our baseline experiments show that HaarPSI$_med$ significantly outperforms SSIM in correlating with the quality ratings.
arXiv Detail & Related papers (2025-07-04T11:06:54Z) - MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning [24.9872402922819]
Existing medical VQA benchmarks mostly focus on single-image analysis.<n>We introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA.
arXiv Detail & Related papers (2025-05-22T17:46:11Z) - Image Quality Assessment for Embodied AI [103.66095742463195]
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories.<n>There is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots.
arXiv Detail & Related papers (2025-05-22T15:51:07Z) - AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation [55.2739790399209]
We present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs.<n>The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation.
arXiv Detail & Related papers (2025-05-17T07:44:54Z) - AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images [58.87047247313503]
We introduce AGHI-QA, the first large-scale benchmark specifically designed for quality assessment of human images (AGHIs)<n>The dataset comprises 4,000 images generated from 400 carefully crafted text prompts using 10 state-of-the-art T2I models.<n>We conduct a systematic subjective study to collect multidimensional annotations, including perceptual quality scores, text-image correspondence scores, visible and distorted body part labels.
arXiv Detail & Related papers (2025-04-30T04:36:56Z) - MD-IQA: Learning Multi-scale Distributed Image Quality Assessment with
Semi Supervised Learning for Low Dose CT [6.158876574189994]
Image quality assessment (IQA) plays a critical role in optimizing radiation dose and developing novel medical imaging techniques.
Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA.
We propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution.
arXiv Detail & Related papers (2023-11-14T09:33:33Z) - Blind Multimodal Quality Assessment: A Brief Survey and A Case Study of
Low-light Images [73.27643795557778]
Blind image quality assessment (BIQA) aims at automatically and accurately forecasting objective scores for visual signals.
Recent developments in this field are dominated by unimodal solutions inconsistent with human subjective rating patterns.
We present a unique blind multimodal quality assessment (BMQA) of low-light images from subjective evaluation to objective score.
arXiv Detail & Related papers (2023-03-18T09:04:55Z) - Image Quality Assessment for Magnetic Resonance Imaging [4.05136808278614]
Image quality assessment (IQA) algorithms aim to reproduce the human's perception of the image quality.
We use outputs of neural network models trained to solve problems relevant to MRI.
Seven trained radiologists assess distorted images, with their verdicts then correlated with 35 different image quality metrics.
arXiv Detail & Related papers (2022-03-15T11:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.