Related papers: Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

URL: http://arxiv.org/abs/2602.11391v1
Date: Wed, 11 Feb 2026 21:53:18 GMT
Title: Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
Authors: Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin, Vladimir Franzuela Cardenas, Farrokh Alemi, Kevin Lybarger,
Abstract summary: Patient simulator generates realistic, controllable patient interactions that vary across medical, linguistic, and behavioral dimensions.<n>Simulator allows annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations.
Score: 1.90974530523188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.

Related papers

DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services [49.70819009392778]
Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers.<n>This study aimed to develop and evaluate a taxonomy-grounded, multi-agent system for simulating realistic scenarios.
arXiv Detail & Related papers (2025-10-24T08:01:21Z)
Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs [3.364244912862208]
We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient subjective judgments.<n>The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million.
arXiv Detail & Related papers (2025-10-05T02:16:35Z)
Organ-Agents: Virtual Human Physiology Simulator via LLMs [66.40796430669158]
Organ-Agents is a multi-agent framework that simulates human physiology via LLM-driven agents.<n>We curated data from 7,134 sepsis patients and 7,895 controls, generating high-resolution trajectories across 9 systems and 125 variables.<n>Organ-Agents achieved high simulation accuracy on 4,509 held-out patients, with per-system MSEs 0.16 and robustness across SOFA-based severity strata.
arXiv Detail & Related papers (2025-08-20T01:58:45Z)
AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data [3.4206930658402115]
We present a Patient Simulator that leverages real world patient encounters.<n>The simulator provides a realistic approach to patient presentation and multi-turn conversation with a symptom-checking agent.
arXiv Detail & Related papers (2025-06-04T14:56:08Z)
PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions [21.02633535404176]
We introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios.<n>PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level.<n>The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework.
arXiv Detail & Related papers (2025-05-23T12:34:48Z)
Simulated patient systems are intelligent when powered by large language model-based AI agents [32.73072809937573]
We developed AIPatient, an intelligent simulated patient system powered by large language model-based AI agents.<n>The system incorporates the Retrieval Augmented Generation framework, powered by six task-specific LLM-based AI agents for complex reasoning.<n>For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data.
arXiv Detail & Related papers (2024-09-27T17:17:15Z)
Automatically measuring speech fluency in people with aphasia: first achievements using read-speech data [55.84746218227712]
This study aims at assessing the relevance of a signalprocessingalgorithm, initially developed in the field of language acquisition, for the automatic measurement of speech fluency.
arXiv Detail & Related papers (2023-08-09T07:51:40Z)
Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial Intelligence [79.038671794961]
We launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution. Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK.
arXiv Detail & Related papers (2021-11-18T00:43:41Z)
Semi-Supervised Variational Reasoning for Medical Dialogue Generation [70.838542865384]
Two key characteristics are relevant for medical dialogue generation: patient states and physician actions. We propose an end-to-end variational reasoning approach to medical dialogue generation. A physician policy network composed of an action-classifier and two reasoning detectors is proposed for augmented reasoning ability.
arXiv Detail & Related papers (2021-05-13T04:14:35Z)
Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks. Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets. We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.