TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine
- URL: http://arxiv.org/abs/2503.07041v1
- Date: Mon, 10 Mar 2025 08:29:15 GMT
- Title: TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine
- Authors: Tianai Huang, Lu Lu, Jiayuan Chen, Lihao Liu, Junjun He, Yuping Zhao, Wenchao Tang, Jie Xu,
- Abstract summary: Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored.<n>To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making.<n>Results show a performance hierarchy: all models have limitations in specialized like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs.
- Score: 10.74071774496229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench's TCM track, aiming to assess LLMs' TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.
Related papers
- BianCang: A Traditional Chinese Medicine Large Language Model [22.582027277167047]
BianCang is a TCM-specific large language model (LLMs) that first injects domain-specific knowledge and then aligns it through targeted stimulation.
We constructed pre-training corpora, instruction-aligned datasets based on real hospital records, and the ChP-TCM dataset derived from the Pharmacopoeia of the People's Republic of China.
We compiled extensive TCM and medical corpora for continuous pre-training and supervised fine-tuning, building a comprehensive dataset to refine the model's understanding of TCM.
arXiv Detail & Related papers (2024-11-17T10:17:01Z) - Intelligent Understanding of Large Language Models in Traditional Chinese Medicine Based on Prompt Engineering Framework [3.990633038739491]
We propose TCM-Prompt, a framework that integrates various pre-trained language models (PLMs), templates, tokenization, and verbalization methods.
We conducted experiments on disease classification, syndrome identification, herbal medicine recommendation, and general NLP tasks.
arXiv Detail & Related papers (2024-10-25T10:24:30Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine [19.680694337954133]
The professional evaluation benchmarks for large language models (LLMs) have yet to be covered in the traditional Chinese medicine(TCM) domain.
To address this research gap, we introduce TCM-Bench, a comprehensive benchmark for evaluating LLM performance in TCM.
It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis.
To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions.
arXiv Detail & Related papers (2024-06-03T09:11:13Z) - Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge [0.0]
We present a TCM question dataset named TCM-QA, which comprises three question types: single choice, multiple choice, and true or false.
In our study, we evaluate two settings of the LLM, zero-shot and few-shot settings, while concurrently discussing the differences between English and Chinese prompts.
Our results indicate that ChatGPT performs best in true or false questions, achieving the highest precision of 0.688 while scoring the lowest precision is 0.241 in multiple-choice questions.
arXiv Detail & Related papers (2024-03-14T08:20:40Z) - A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.<n>Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.<n>We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - CMB: A Comprehensive Medical Benchmark in Chinese [67.69800156990952]
We propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese.
While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety.
We have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain.
arXiv Detail & Related papers (2023-08-17T07:51:23Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.