"Crash Test Dummies" for AI-Enabled Clinical Assessment: Validating Virtual Patient Scenarios with Virtual Learners
- URL: http://arxiv.org/abs/2601.18085v1
- Date: Mon, 26 Jan 2026 02:47:28 GMT
- Title: "Crash Test Dummies" for AI-Enabled Clinical Assessment: Validating Virtual Patient Scenarios with Virtual Learners
- Authors: Brian Gin, Ahreum Lim, Flávia Silva e Oliveira, Kuan Xing, Xiaomei Song, Gayana Amiyangoda, Thilanka Seneviratne, Alison F. Doubleday, Ananya Gangopadhyaya, Bob Kiser, Lukas Shum-Tim, Dhruva Patel, Kosala Marambe, Lauren Maggio, Ara Tekian, Yoon Soo Park,
- Abstract summary: In medical and health professions education, AI is increasingly used to assess clinical competencies, including via virtual standardized patients.<n>Most evaluations rely on AI-human interrater reliability and lack a measurement framework for how cases, learners, and raters jointly shape scores.<n>We develop an open-source platform and measurement model for robust competency evaluation across cases and rating conditions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Background: In medical and health professions education (HPE), AI is increasingly used to assess clinical competencies, including via virtual standardized patients. However, most evaluations rely on AI-human interrater reliability and lack a measurement framework for how cases, learners, and raters jointly shape scores. This leaves robustness uncertain and can expose learners to misguidance from unvalidated systems. We address this by using AI "simulated learners" to stress-test and psychometrically characterize assessment pipelines before human use. Objective: Develop an open-source AI virtual patient platform and measurement model for robust competency evaluation across cases and rating conditions. Methods: We built a platform with virtual patients, virtual learners with tunable ACGME-aligned competency profiles, and multiple independent AI raters scoring encounters with structured Key-Features items. Transcripts were analyzed with a Bayesian HRM-SDT model that treats ratings as decisions under uncertainty and separates learner ability, case performance, and rater behavior; parameters were estimated with MCMC. Results: The model recovered simulated learners' competencies, with significant correlations to the generating competencies across all ACGME domains despite a non-deterministic pipeline. It estimated case difficulty by competency and showed stable rater detection (sensitivity) and criteria (severity/leniency thresholds) across AI raters using identical models/prompts but different seeds. We also propose a staged "safety blueprint" for deploying AI tools with learners, tied to entrustment-based validation milestones. Conclusions: Combining a purpose-built virtual patient platform with a principled psychometric model enables robust, interpretable, generalizable competency estimates and supports validation of AI-assisted assessment prior to use with human learners.
Related papers
- A Voice-Enabled Virtual Patient System for Interactive Training in Standardized Clinical Assessment [0.0]
We introduce a voice-enabled virtual patient simulation system powered by a large language model (LLM)<n>This study describes the system's development and validates its ability to generate virtual patients who adhere to pre-defined clinical profiles.<n>Our findings suggest that LLM-powered virtual patient simulations are a viable and scalable tool for training clinicians.
arXiv Detail & Related papers (2025-11-01T21:18:08Z) - How to Evaluate Medical AI [4.23552814358972]
We introduce Relative Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD)<n>RPAD and RRAD compare AI outputs against multiple expert opinions rather than a single reference.<n>Large-scale study shows that top-performing models, such as DeepSeek-V3, achieve consistency on par with or exceeding expert consensus.
arXiv Detail & Related papers (2025-09-15T14:01:22Z) - Automated Real-time Assessment of Intracranial Hemorrhage Detection AI Using an Ensembled Monitoring Model (EMM) [1.8767322781894276]
We introduce Ensembled Monitoring Model (EMM), a framework inspired by clinical consensus practices using multiple expert reviews.<n>EMM operates independently without requiring access to internal AI components or intermediate outputs.<n>We demonstrate that EMM successfully categorizes confidence in the AI-generated prediction, suggesting different actions.
arXiv Detail & Related papers (2025-05-16T22:50:42Z) - Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics [8.692647930497936]
We use conformal analysis to quantify the predictive uncertainty of a vision transformer based foundation model.<n>We show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model.
arXiv Detail & Related papers (2025-03-31T08:06:00Z) - Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering [51.26412822853409]
We present a novel personalized federated learning (pFL) method for medical visual question answering (VQA) models.
Our method introduces learnable prompts into a Transformer architecture to efficiently train it on diverse medical datasets without massive computational costs.
arXiv Detail & Related papers (2024-10-23T00:31:17Z) - Simulated patient systems are intelligent when powered by large language model-based AI agents [32.73072809937573]
We developed AIPatient, an intelligent simulated patient system powered by large language model-based AI agents.<n>The system incorporates the Retrieval Augmented Generation framework, powered by six task-specific LLM-based AI agents for complex reasoning.<n>For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data.
arXiv Detail & Related papers (2024-09-27T17:17:15Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system.
Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model.
We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z) - Clinical Outcome Prediction from Admission Notes using Self-Supervised
Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks.
Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets.
We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z) - Hemogram Data as a Tool for Decision-making in COVID-19 Management:
Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure.
This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients.
Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.