Related papers: Metrics that matter: Evaluating image quality metrics for medical image generation

Metrics that matter: Evaluating image quality metrics for medical image generation

URL: http://arxiv.org/abs/2505.07175v1
Date: Mon, 12 May 2025 01:57:25 GMT
Title: Metrics that matter: Evaluating image quality metrics for medical image generation
Authors: Yash Deo, Yan Jia, Toni Lassila, William A. P. Smith, Tom Lawton, Siyuan Kang, Alejandro F. Frangi, Ibrahim Habli,
Abstract summary: This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data.<n>We evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, morphological alterations designed to mimic clinically relevant inaccuracies.
Score: 48.85783422900129
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating generative models for synthetic medical imaging is crucial yet challenging, especially given the high standards of fidelity, anatomical accuracy, and safety required for clinical applications. Standard evaluation of generated images often relies on no-reference image quality metrics when ground truth images are unavailable, but their reliability in this complex domain is not well established. This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data, including tumour and vascular images, providing a representative exemplar for the field. We systematically evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, localised morphological alterations designed to mimic clinically relevant inaccuracies. We then compare these metric scores against model performance on a relevant downstream segmentation task, analysing results across both controlled image perturbations and outputs from different generative model architectures. Our findings reveal significant limitations: many widely-used no-reference image quality metrics correlate poorly with downstream task suitability and exhibit a profound insensitivity to localised anatomical details crucial for clinical validity. Furthermore, these metrics can yield misleading scores regarding distribution shifts, e.g. data memorisation. This reveals the risk of misjudging model readiness, potentially leading to the deployment of flawed tools that could compromise patient safety. We conclude that ensuring generative models are truly fit for clinical purpose requires a multifaceted validation framework, integrating performance on relevant downstream tasks with the cautious interpretation of carefully selected no-reference image quality metrics.

Related papers

Beyond Pixels: Medical Image Quality Assessment with Implicit Neural Representations [2.0934875997852096]
Artifacts pose a significant challenge in medical imaging, impacting diagnostic accuracy and downstream analysis.<n>We propose the use of implicit neural representations (INRs) for image quality assessment.<n>Our method is evaluated on the ACDC dataset with synthetically generated artifact patterns.
arXiv Detail & Related papers (2025-08-07T09:00:06Z)
IMPACT: A Generic Semantic Loss for Multimodal Medical Image Registration [0.46904601975060667]
IMPACT (Image Metric with Pretrained model-Agnostic Comparison for Transmodality registration) is a novel similarity metric designed for robust multimodal image registration.<n>It defines a semantic similarity measure based on the comparison of deep features extracted from large-scale pretrained segmentation models.<n>It was evaluated on five challenging 3D registration tasks involving thoracic CT/CBCT and pelvic MR/CT datasets.
arXiv Detail & Related papers (2025-03-31T14:08:21Z)
Trustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training scenarios [0.0]
Mammographic screening is an effective method for detecting breast cancer, facilitating early diagnosis.<n>Deep neural networks have been shown effective in some studies, but their tendency to overfit leaves considerable risk for poor generalisation and misdiagnosis.<n>Data augmentation schemes based on unpaired neural style transfer models have been proposed that improve generalisability.<n>We evaluate their performance when trained on image patches parsed from three open access mammography datasets and one non-medical image dataset.
arXiv Detail & Related papers (2025-01-29T11:09:50Z)
Fréchet Radiomic Distance (FRD): A Versatile Metric for Comparing Medical Imaging Datasets [13.737058479403311]
We introduce a new perceptual metric tailored for medical images, FRD (Fr'echet Radiomic Distance)<n>We show that FRD is superior to other image distribution metrics for a range of medical imaging applications.<n> FRD offers additional benefits such as stability and computational efficiency at low sample sizes.
arXiv Detail & Related papers (2024-12-02T13:49:14Z)
A Unified Model for Compressed Sensing MRI Across Undersampling Patterns [69.19631302047569]
We propose a unified MRI reconstruction model robust to various measurement undersampling patterns and image resolutions.<n>Our model improves SSIM by 11% and PSNR by 4 dB over a state-of-the-art CNN (End-to-End VarNet) with 600$times$ faster inference than diffusion methods.
arXiv Detail & Related papers (2024-10-05T20:03:57Z)
Five Pitfalls When Assessing Synthetic Medical Images with Reference Metrics [0.9582978458237521]
Reference metrics have been developed to objectively and quantitatively compare two images. The correlation of reference metrics and human perception of quality can vary strongly for different kinds of distortions. We selected five pitfalls that showcase unexpected and probably undesired reference metric scores.
arXiv Detail & Related papers (2024-08-12T11:48:57Z)
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge [93.61262892578067]
Uncertainty in medical image segmentation tasks, especially inter-rater variability, presents a significant challenge. This variability directly impacts the development and evaluation of automated segmentation algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ)
arXiv Detail & Related papers (2024-03-19T17:57:24Z)
K-Space-Aware Cross-Modality Score for Synthesized Neuroimage Quality Assessment [71.27193056354741]
The problem of how to assess cross-modality medical image synthesis has been largely unexplored. We propose a new metric K-CROSS to spur progress on this challenging problem. K-CROSS uses a pre-trained multi-modality segmentation network to predict the lesion location.
arXiv Detail & Related papers (2023-07-10T01:26:48Z)
On Sensitivity and Robustness of Normalization Schemes to Input Distribution Shifts in Automatic MR Image Diagnosis [58.634791552376235]
Deep Learning (DL) models have achieved state-of-the-art performance in diagnosing multiple diseases using reconstructed images as input. DL models are sensitive to varying artifacts as it leads to changes in the input data distribution between the training and testing phases. We propose to use other normalization techniques, such as Group Normalization and Layer Normalization, to inject robustness into model performance against varying image artifacts.
arXiv Detail & Related papers (2023-06-23T03:09:03Z)
Explainable Image Quality Assessment for Medical Imaging [0.0]
Poor-quality medical images may lead to misdiagnosis. We propose an explainable image quality assessment system and validate our idea on two different objectives. We apply a variety of techniques to measure the faithfulness of the saliency detectors. We show that NormGrad has significant gains over other saliency detectors by reaching a repeated Pointing Game score of 0.853 for Object-CXR and 0.611 for LVOT datasets.
arXiv Detail & Related papers (2023-03-25T14:18:39Z)
Negligible effect of brain MRI data preprocessing for tumor segmentation [36.89606202543839]
We conduct experiments on three publicly available datasets and evaluate the effect of different preprocessing steps in deep neural networks. Our results demonstrate that most popular standardization steps add no value to the network performance. We suggest that image intensity normalization approaches do not contribute to model accuracy because of the reduction of signal variance with image standardization.
arXiv Detail & Related papers (2022-04-11T17:29:36Z)
Malignancy Prediction and Lesion Identification from Clinical Dermatological Images [65.1629311281062]
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images. We first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy.
arXiv Detail & Related papers (2021-04-02T20:52:05Z)
Semi-supervised Medical Image Classification with Relation-driven Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification. It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations. Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.