Related papers: Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

URL: http://arxiv.org/abs/2510.01683v1
Date: Thu, 02 Oct 2025 05:15:40 GMT
Title: Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring
Authors: Han-Jay Shu, Wei-Ning Chiu, Shun-Ting Chang, Meng-Ping Huang, Takeshi Tohyama, Ahram Han, Po-Chih Kuo,
Abstract summary: We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone chest radiograph (CXR) cases.<n>ASRS scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall.<n>ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.
Score: 1.9837702647603577
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

Related papers

Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z)
X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging [67.85884025186755]
High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns.<n>Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images.<n>We propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection.
arXiv Detail & Related papers (2026-02-10T00:03:43Z)
Empirical Analysis of Adversarial Robustness and Explainability Drift in Cybersecurity Classifiers [0.0]
This paper presents an empirical study of adversarial robustness and explainability drift across two cybersecurity domains.<n>We introduce a quantitative metric, the Robustness Index (RI), defined as the area under the accuracy perturbation curve.<n>Experiments on the Phishing Websites and NB15 datasets show consistent robustness trends, with adversarial training improving RI by up to 9 percent while maintaining clean-data accuracy.
arXiv Detail & Related papers (2026-02-06T05:30:37Z)
Decision-Aware Trust Signal Alignment for SOC Alert Triage [0.0]
The present paper presents a decision-sensitive trust signal correspondence scheme of SOC alert triage.<n>The framework combines confidence that has been calibrated, lightweight uncertainty cues, and cost-sensitive decision thresholds into coherent decision-support layer.<n>We show that false negatives are greatly amplified by the presence of misaligned displays of confidence, whereas cost weighted loss decreases by orders of magnitude between models with decision aligned trust signals.
arXiv Detail & Related papers (2026-01-08T01:41:54Z)
I Detect What I Don't Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging [2.384534878752428]
We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels.<n>A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead.<n>On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982; on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269.
arXiv Detail & Related papers (2025-11-05T23:28:14Z)
Conformal Lesion Segmentation for 3D Medical Images [82.92159832699583]
We propose a risk-constrained framework that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance.<n>We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.
arXiv Detail & Related papers (2025-10-19T08:21:00Z)
AT-CXR: Uncertainty-Aware Agentic Triage for Chest X-rays [12.843444405498404]
We introduce AT-CXR, an uncertainty-aware agent for chest X-rays.<n>The system estimates per-case confidence and distributional fit, then follows a stepwise policy to issue an automated decision.<n>We evaluate two router designs that share the same inputs and actions.
arXiv Detail & Related papers (2025-08-26T14:33:09Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification [0.0]
We propose a framework to automatically flag problematic training images that introduce spurious correlations.<n>Removing 8.2% of flagged images improves model AUC-ROC by 5% (85% to 90%) on a held out test set.<n>When applied to a subset of training data rated by two dermatologists, the method identifies over 90% of cases with inter-rater disagreement.
arXiv Detail & Related papers (2025-06-27T03:42:09Z)
TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z)
Statistical Management of the False Discovery Rate in Medical Instance Segmentation Based on Conformal Risk Control [2.4578723416255754]
Instance segmentation plays a pivotal role in medical image analysis by enabling precise localization and delineation of lesions, tumors, and anatomical structures.<n>Deep learning models such as Mask R-CNN and BlendMask have achieved remarkable progress, but their application in high-risk medical scenarios remains constrained by confidence calibration issues.<n>We propose a robust quality control framework based on conformal prediction theory to address this challenge.
arXiv Detail & Related papers (2025-04-06T13:31:19Z)
Inadequacy of common stochastic neural networks for reliable clinical decision support [0.4262974002462632]
Widespread adoption of AI for medical decision making is still hindered due to ethical and safety-related concerns. Common deep learning approaches, however, have the tendency towards overconfidence under data shift. This study investigates their actual reliability in clinical applications.
arXiv Detail & Related papers (2024-01-24T18:49:30Z)
Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative. We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z)
Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty [57.023423137202485]
Concerns regarding the reliability of medical image segmentation persist among clinicians.<n>We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.<n>By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation.
arXiv Detail & Related papers (2023-01-01T05:02:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.