TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine
- URL: http://arxiv.org/abs/2511.13169v1
- Date: Mon, 17 Nov 2025 09:15:41 GMT
- Title: TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine
- Authors: Tianai Huang, Jiayuan Chen, Lu Lu, Pengcheng Chen, Tianbin Li, Bing Han, Wenchao Tang, Jie Xu, Ming Li,
- Abstract summary: Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous evaluation.<n>TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-seek), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT)
- Score: 11.944521938566231
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek\_r1 and gemini\_2\_5\_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.
Related papers
- MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z) - A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models [2.5287456399381494]
Large Language Models (LLMs) within the Traditional Chinese Medicine domain present an urgent need to assess their clinical application capabilities.<n>Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation.<n>Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence.
arXiv Detail & Related papers (2025-12-02T14:26:44Z) - TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine [51.01817637808011]
We introduce TCM-Eval, the first dynamic and high-quality benchmark for Traditional Chinese Medicine (TCM)<n>We construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE)<n>Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM.
arXiv Detail & Related papers (2025-11-10T14:35:25Z) - Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z) - ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine [53.91744478760689]
We present ShizhenGPT, the first multimodal language model tailored for Traditional Chinese Medicine (TCM)<n>ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning.<n>Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models.
arXiv Detail & Related papers (2025-08-20T13:30:20Z) - MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine [36.08458917280579]
MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation.<n>Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance.
arXiv Detail & Related papers (2025-06-02T02:01:40Z) - Tianyi: A Traditional Chinese Medicine all-rounder language model and its Real-World Clinical Practice [15.020917068333237]
Tianyi is designed to assimilate interconnected and systematic TCM knowledge through a progressive learning manner.<n>Extensive evaluations demonstrate the significant potential of Tianyi as an AI assistant in TCM clinical practice and research.
arXiv Detail & Related papers (2025-05-19T14:17:37Z) - TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine [10.74071774496229]
Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored.<n>To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making.<n>Results show a performance hierarchy: all models have limitations in specialized like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs.
arXiv Detail & Related papers (2025-03-10T08:29:15Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine [19.680694337954133]
The professional evaluation benchmarks for large language models (LLMs) have yet to be covered in the traditional Chinese medicine(TCM) domain.
To address this research gap, we introduce TCM-Bench, a comprehensive benchmark for evaluating LLM performance in TCM.
It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis.
To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions.
arXiv Detail & Related papers (2024-06-03T09:11:13Z) - A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.<n>Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.<n>We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.