Related papers: An autonomous agent for auditing and improving the reliability of clinical AI models

An autonomous agent for auditing and improving the reliability of clinical AI models

URL: http://arxiv.org/abs/2507.05755v1
Date: Tue, 08 Jul 2025 07:58:52 GMT
Title: An autonomous agent for auditing and improving the reliability of clinical AI models
Authors: Lukas Kuhn, Florian Buettner,
Abstract summary: We introduce ModelAuditor, a self-reflective agent that converses with users.<n>ModelAuditor simulates context-dependent, clinically relevant distribution shifts.<n>It then generates interpretable reports explaining how much performance likely degrades during deployment.
Score: 11.225863068085266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The deployment of AI models in clinical practice faces a critical challenge: models achieving expert-level performance on benchmarks can fail catastrophically when confronted with real-world variations in medical imaging. Minor shifts in scanner hardware, lighting or demographics can erode accuracy, but currently reliability auditing to identify such catastrophic failure cases before deployment is a bespoke and time-consuming process. Practitioners lack accessible and interpretable tools to expose and repair hidden failure modes. Here we introduce ModelAuditor, a self-reflective agent that converses with users, selects task-specific metrics, and simulates context-dependent, clinically relevant distribution shifts. ModelAuditor then generates interpretable reports explaining how much performance likely degrades during deployment, discussing specific likely failure modes and identifying root causes and mitigation strategies. Our comprehensive evaluation across three real-world clinical scenarios - inter-institutional variation in histopathology, demographic shifts in dermatology, and equipment heterogeneity in chest radiography - demonstrates that ModelAuditor is able correctly identify context-specific failure modes of state-of-the-art models such as the established SIIM-ISIC melanoma classifier. Its targeted recommendations recover 15-25% of performance lost under real-world distribution shift, substantially outperforming both baseline models and state-of-the-art augmentation methods. These improvements are achieved through a multi-agent architecture and execute on consumer hardware in under 10 minutes, costing less than US$0.50 per audit.

Related papers

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Iterative Misclassification Error Training (IMET): An Optimized Neural Network Training Technique for Image Classification [0.5115559623386964]
We introduce Iterative Misclassification Error Training (IMET), a novel framework inspired by curriculum learning and coreset selection.<n>IMET aims to identify misclassified samples in order to streamline the training process, while prioritizing the model's attention to edge case senarious and rare outcomes.<n>The paper evaluates IMET's performance on benchmark medical image classification datasets against state-of-the-art ResNet architectures.
arXiv Detail & Related papers (2025-07-01T04:14:16Z)
Keeping Medical AI Healthy: A Review of Detection and Correction Methods for System Degradation [6.781778751487079]
This review presents a forward-looking perspective on monitoring and maintaining the "health" of AI systems in healthcare.<n>We highlight the urgent need for continuous performance monitoring, early degradation detection, and effective self-correction mechanisms.<n>This work aims to guide the development of reliable, robust medical AI systems capable of sustaining safe, long-term deployment in dynamic clinical settings.
arXiv Detail & Related papers (2025-06-20T19:22:07Z)
Examining Deployment and Refinement of the VIOLA-AI Intracranial Hemorrhage Model Using an Interactive NeoMedSys Platform [0.6582858408923039]
The current study describes a radiology software platform called NeoMedSys that can enable efficient deployment and refinements of AI models.<n>We evaluated the feasibility and effectiveness of running NeoMedSys for three months in real-world clinical settings.
arXiv Detail & Related papers (2025-05-14T13:33:38Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Generalizable automated ischaemic stroke lesion segmentation with vision transformers [0.7400397057238803]
Diffusion-weighted imaging (DWI) provides the highest expressivity in ischemic stroke.<n>Current U-Net-based models therefore underperform, a problem accentuated by inadequate evaluation metrics.<n>Here, we present a high-performance DWI lesion segmentation tool addressing these challenges.
arXiv Detail & Related papers (2025-02-10T19:00:00Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Unmasking Dementia Detection by Masking Input Gradients: A JSM Approach to Model Interpretability and Precision [1.5501208213584152]
We introduce an interpretable, multimodal model for Alzheimer's disease (AD) classification over its multi-stage progression, incorporating Jacobian Saliency Map (JSM) as a modality-agnostic tool. Our evaluation including ablation study manifests the efficacy of using JSM for model debug and interpretation, while significantly enhancing model accuracy as well.
arXiv Detail & Related papers (2024-02-25T06:53:35Z)
TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials. We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z)
DirectDebug: Automated Testing and Debugging of Feature Models [55.41644538483948]
Variability models (e.g., feature models) are a common way for the representation of variabilities and commonalities of software artifacts. Complex and often large-scale feature models can become faulty, i.e., do not represent the expected variability properties of the underlying software artifact.
arXiv Detail & Related papers (2021-02-11T11:22:20Z)
Adversarial Sample Enhanced Domain Adaptation: A Case Study on Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation. adversarially generated samples are used during domain adaptation. Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.