Related papers: ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

URL: http://arxiv.org/abs/2603.02097v2
Date: Tue, 03 Mar 2026 04:45:44 GMT
Title: ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels
Authors: Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, Bing Zhao,
Abstract summary: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care.<n>We introduce ClinConsensus, a Chinese medical benchmark curated, validated and quality-controlled by clinical experts.<n> ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity.
Score: 39.33170904610862
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.

Related papers

MedConsultBench: A Full-Cycle, Fine-Grained, Process-Aware Benchmark for Medical Consultation Agents [10.109613967215447]
We propose MedConsultBench, a comprehensive framework designed to evaluate the complete online consultation cycle.<n>Our methodology introduces Atomic Information Units (AIUs) to track clinical information acquisition at a sub-turn level.<n>By addressing the underspecification and ambiguity inherent in online consultations, the benchmark evaluates uncertainty-aware yet concise inquiry.
arXiv Detail & Related papers (2026-01-19T02:18:10Z)
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z)
ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z)
CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment [29.48544328813161]
We introduce CP-Env, a controllable agentic hospital environment designed to evaluate large language models (LLMs) across end-to-end clinical pathways.<n>Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution.<n>Results reveal that most models struggle with pathway hallucinations, exhibiting complexity and losing critical diagnostic details.
arXiv Detail & Related papers (2025-12-11T01:54:55Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models [10.46932473088646]
We introduce MedGUIDE, a new benchmark for evaluating Large Language Models (LLMs) on their ability to make guideline-consistent clinical decisions.<n> MedGUIDE is constructed from 55 curated NCCN decision trees across 17 cancer types.<n>We apply a two-stage quality selection process, combining expert-labeled reward models and LLM-as-a-judge ensembles across ten clinical and linguistic criteria, to select 7,747 high-quality samples.
arXiv Detail & Related papers (2025-05-16T18:21:52Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation [31.061600616994145]
HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors.<n>The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models.<n>This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.
arXiv Detail & Related papers (2025-01-12T07:30:49Z)
Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm [15.627870862369784]
Large language models (LLMs) are gaining increasing interests to improve clinical efficiency for medical diagnosis. We propose an automatic evaluation paradigm tailored to assess the LLMs' capabilities in delivering clinical services.
arXiv Detail & Related papers (2024-03-25T06:17:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.