A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models
- URL: http://arxiv.org/abs/2512.02816v1
- Date: Tue, 02 Dec 2025 14:26:44 GMT
- Title: A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models
- Authors: Kunning Li, Jianbin Guo, Zhaoyang Shang, Yiqing Liu, Hongmin Du, Lingling Liu, Yuping Zhao, Lifeng Dong,
- Abstract summary: Large Language Models (LLMs) within the Traditional Chinese Medicine domain present an urgent need to assess their clinical application capabilities.<n>Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation.<n>Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence.
- Score: 2.5287456399381494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM's "Syndrome Differentiation and Treatment" (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates three mechanisms, namely selected-response evaluation, judge model evaluation, and reward model evaluation. The effectiveness of TCM-BEST4SDT was corroborated through experiments on 15 mainstream LLMs, spanning both general and TCM domains. To foster the development of intelligent TCM research, TCM-BEST4SDT is now publicly available.
Related papers
- LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning [27.37958097277936]
LingLan benchmark is a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making.<n>LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition.
arXiv Detail & Related papers (2026-02-02T08:02:25Z) - MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z) - TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine [11.944521938566231]
Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous evaluation.<n>TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-seek), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT)
arXiv Detail & Related papers (2025-11-17T09:15:41Z) - TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine [51.01817637808011]
We introduce TCM-Eval, the first dynamic and high-quality benchmark for Traditional Chinese Medicine (TCM)<n>We construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE)<n>Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM.
arXiv Detail & Related papers (2025-11-10T14:35:25Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z) - ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine [53.91744478760689]
We present ShizhenGPT, the first multimodal language model tailored for Traditional Chinese Medicine (TCM)<n>ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning.<n>Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models.
arXiv Detail & Related papers (2025-08-20T13:30:20Z) - MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine [36.08458917280579]
MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation.<n>Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance.
arXiv Detail & Related papers (2025-06-02T02:01:40Z) - TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine [22.179602943420907]
We introduce TCM-Ladder, the first comprehensive multimodal QA dataset specifically designed for evaluating large TCM language models.<n>The dataset covers multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics.<n>The dataset was constructed using a combination of automated and manual filtering processes and comprises over 52,000 questions.
arXiv Detail & Related papers (2025-05-29T23:13:57Z) - TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine [10.74071774496229]
Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored.<n>To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making.<n>Results show a performance hierarchy: all models have limitations in specialized like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs.
arXiv Detail & Related papers (2025-03-10T08:29:15Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.