Related papers: The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

URL: http://arxiv.org/abs/2509.24958v1
Date: Mon, 29 Sep 2025 15:52:36 GMT
Title: The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability
Authors: Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu,
Abstract summary: We present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning.<n>It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure.<n>We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience.
Score: 15.649293541650811
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

Related papers

EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning [72.70291772077738]
We propose Med-Inquire, a new benchmark designed to evaluate an agent's ability to perform multi-turn diagnosis.<n>We then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time.<n>Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents.
arXiv Detail & Related papers (2026-01-30T13:26:18Z)
The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency [38.68458713626548]
Current benchmarks fail to capture the integrated, multimodal reasoning essential for real-world patient care.<n>This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway.<n>Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning.
arXiv Detail & Related papers (2025-12-25T03:33:22Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation [23.85199484712499]
We propose DyReMe, a dynamic benchmark for medical diagnostics that better reflects real clinical practice.<n>Unlike static exam-style questions, DyReMe generates fresh, consultation-like cases that introduce distractors such as differential diagnoses and common misdiagnosis factors.<n>Our experiments demonstrate that this dynamic approach yields more challenging and realistic assessments, revealing significant misalignments between the performance of state-of-the-art LLMs and real clinical practice.
arXiv Detail & Related papers (2025-10-10T11:19:04Z)
MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z)
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation [55.2739790399209]
We present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs.<n>The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation.
arXiv Detail & Related papers (2025-05-17T07:44:54Z)
Advancing Conversational Diagnostic AI with Multimodal Reasoning [44.1996223689966]
Articulate Medical Intelligence Explorer (AMIE)<n>System implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs.<n>We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors.
arXiv Detail & Related papers (2025-05-06T20:52:01Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
WiseMind: Recontextualizing AI with a Knowledge-Guided, Theory-Informed Multi-Agent Framework for Instrumental and Humanistic Benefits [10.8749978349074]
WiseMind is an interdisciplinary contextualization framework for NLP.<n>Tested on depression, anxiety, and bipolar disorder, WiseMind attains up to 84.2% diagnostic accuracy.<n>Results show that deep contextualization-across knowledge, process, and evaluation layers-can transform benchmark-driven NLP into clinically meaningful impact.
arXiv Detail & Related papers (2025-02-28T03:45:39Z)
Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning [25.068780967617485]
Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions.<n>We propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction.<n>APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement.
arXiv Detail & Related papers (2025-02-11T00:13:52Z)
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments [2.567146936147657]
We introduce AgentClinic, a multimodal agent benchmark for evaluating large language models (LLM) in simulated clinical environments.<n>We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy.
arXiv Detail & Related papers (2024-05-13T17:38:53Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Towards Conversational Diagnostic AI [32.84876349808714]
We introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors.
arXiv Detail & Related papers (2024-01-11T04:25:06Z)
Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting [82.64015366154884]
We study the task of cognitive distortion detection and propose the Diagnosis of Thought (DoT) prompting. DoT performs diagnosis on the patient's speech via three stages: subjectivity assessment to separate the facts and the thoughts; contrastive reasoning to elicit the reasoning processes supporting and contradicting the thoughts; and schema analysis to summarize the cognition schemas. Experiments demonstrate that DoT obtains significant improvements over ChatGPT for cognitive distortion detection, while generating high-quality rationales approved by human experts.
arXiv Detail & Related papers (2023-10-11T02:47:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.