How to Evaluate Medical AI
- URL: http://arxiv.org/abs/2509.11941v2
- Date: Thu, 25 Sep 2025 09:31:04 GMT
- Title: How to Evaluate Medical AI
- Authors: Ilia Kopanichuk, Petr Anokhin, Vladimir Shaposhnikov, Vladimir Makharev, Ekaterina Tsapieva, Iaroslav Bespalov, Dmitry V. Dylov, Ivan Oseledets,
- Abstract summary: We introduce Relative Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD)<n>RPAD and RRAD compare AI outputs against multiple expert opinions rather than a single reference.<n>Large-scale study shows that top-performing models, such as DeepSeek-V3, achieve consistency on par with or exceeding expert consensus.
- Score: 4.23552814358972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The integration of artificial intelligence (AI) into medical diagnostic workflows requires robust and consistent evaluation methods to ensure reliability, clinical relevance, and the inherent variability in expert judgments. Traditional metrics like precision and recall often fail to account for the inherent variability in expert judgments, leading to inconsistent assessments of AI performance. Inter-rater agreement statistics like Cohen's Kappa are more reliable but they lack interpretability. We introduce Relative Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD) - a new evaluation metrics that compare AI outputs against multiple expert opinions rather than a single reference. By normalizing performance against inter-expert disagreement, these metrics provide a more stable and realistic measure of the quality of predicted diagnosis. In addition to the comprehensive analysis of diagnostic quality measures, our study contains a very important side result. Our evaluation methodology allows us to avoid selecting diagnoses from a limited list when evaluating a given case. Instead, both the models being tested and the examiners verifying them arrive at a free-form diagnosis. In this automated methodology for establishing the identity of free-form clinical diagnoses, a remarkable 98% accuracy becomes attainable. We evaluate our approach using 360 medical dialogues, comparing multiple large language models (LLMs) against a panel of physicians. Large-scale study shows that top-performing models, such as DeepSeek-V3, achieve consistency on par with or exceeding expert consensus. Moreover, we demonstrate that expert judgments exhibit significant variability - often greater than that between AI and humans. This finding underscores the limitations of any absolute metrics and supports the need to adopt relative metrics in medical AI.
Related papers
- Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations.<n>Our benchmark unifies three types of medical data for open-ended diagnostic generation.<n>We present MedConf, an evidence-grounded linguistic self-assessment framework.
arXiv Detail & Related papers (2026-01-22T04:51:39Z) - A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist [1.1731001328350983]
This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset.<n>We analyze both cognitive adaptation and calibration error using metrics: Expected Error (ECE) and a baseline-normalized Relative Error (RCE)<n>Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role-playing conditions.
arXiv Detail & Related papers (2025-10-22T00:15:02Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - Sequential Diagnosis with Language Models [21.22416732642907]
We introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging cases into stepwise diagnostic encounters.<n>Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed.<n>We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians.
arXiv Detail & Related papers (2025-06-27T17:27:26Z) - Automated Real-time Assessment of Intracranial Hemorrhage Detection AI Using an Ensembled Monitoring Model (EMM) [1.8767322781894276]
We introduce Ensembled Monitoring Model (EMM), a framework inspired by clinical consensus practices using multiple expert reviews.<n>EMM operates independently without requiring access to internal AI components or intermediate outputs.<n>We demonstrate that EMM successfully categorizes confidence in the AI-generated prediction, suggesting different actions.
arXiv Detail & Related papers (2025-05-16T22:50:42Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Uncertainty-aware abstention in medical diagnosis based on medical texts [87.88110503208016]
This study addresses the critical issue of reliability for AI-assisted medical diagnosis.<n>We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis.<n>We introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks.
arXiv Detail & Related papers (2025-02-25T10:15:21Z) - Unified Uncertainty Estimation for Cognitive Diagnosis Models [70.46998436898205]
We propose a unified uncertainty estimation approach for a wide range of cognitive diagnosis models.
We decompose the uncertainty of diagnostic parameters into data aspect and model aspect.
Our method is effective and can provide useful insights into the uncertainty of cognitive diagnosis.
arXiv Detail & Related papers (2024-03-09T13:48:20Z) - Evaluating AI systems under uncertain ground truth: a case study in dermatology [43.8328264420381]
We show that ignoring uncertainty leads to overly optimistic estimates of model performance.<n>In skin condition classification, we find that a large portion of the dataset exhibits significant ground truth uncertainty.
arXiv Detail & Related papers (2023-07-05T10:33:45Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Explaining medical AI performance disparities across sites with
confounder Shapley value analysis [8.785345834486057]
Multi-site evaluations are key to diagnosing such disparities.
Our framework provides a method for quantifying the marginal and cumulative effect of each type of bias on the overall performance difference.
We demonstrate its usefulness in a case study of a deep learning model trained to detect the presence of pneumothorax.
arXiv Detail & Related papers (2021-11-12T18:54:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.