Related papers: LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

URL: http://arxiv.org/abs/2602.09379v2
Date: Wed, 11 Feb 2026 02:31:46 GMT
Title: LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
Authors: Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng,
Abstract summary: Mental disorders are highly prevalent worldwide.<n>The shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment.<n>We present LingxiDiagBench, a large-scale multi-agent benchmark.
Score: 14.82377002030236
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

Related papers

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models [28.184599359142307]
MentalBench is a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs)<n>At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders.
arXiv Detail & Related papers (2026-02-13T12:21:33Z)
MentalSeek-Dx: Towards Progressive Hypothetico-Deductive Reasoning for Real-world Psychiatric Diagnosis [27.839664095206857]
MentalSeek-Dx Bench is the first benchmark dedicated to disorder-level psychiatric diagnosis within real-world clinical settings.<n>It comprises 712 de-identified electronic health records annotated by board-certified psychiatrists under ICD-11 guidelines.<n>MentalSeek-Dx achieves state-of-the-art (SOTA) performance with only 14B parameters, establishing a clinically grounded framework for reliable psychiatric diagnosis.
arXiv Detail & Related papers (2026-02-03T10:03:35Z)
ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z)
Evolving Diagnostic Agents in a Virtual Clinical Environment [75.59389103511559]
We present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning.<n>Our method acquires diagnostic strategies through interactive exploration and outcome-based feedback.<n>DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o.
arXiv Detail & Related papers (2025-10-28T17:19:47Z)
Timely Clinical Diagnosis through Active Test Selection [49.091903570068155]
We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design) to better emulate real-world diagnostic reasoning.<n>LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data.<n>We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
arXiv Detail & Related papers (2025-10-21T18:10:45Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry [1.2879523047871226]
PsychiatryBench is a rigorously curated benchmark grounded exclusively in expert-validated psychiatric textbooks and casebooks.<n> PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items.
arXiv Detail & Related papers (2025-09-07T20:57:24Z)
MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z)
Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling [50.83055329849865]
PsyLLM is a large language model designed to integrate diagnostic and therapeutic reasoning for mental health counseling.<n>It processes real-world mental health posts from Reddit and generates multi-turn dialogue structures.<n>Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2025-05-21T16:24:49Z)
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models [25.13622249539088]
DiagnosisArena is a benchmark designed to rigorously assess professional-level diagnostic competence.<n> DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties.<n>Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively.
arXiv Detail & Related papers (2025-05-20T09:14:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.