MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models
- URL: http://arxiv.org/abs/2312.12806v1
- Date: Wed, 20 Dec 2023 07:01:49 GMT
- Title: MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models
- Authors: Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang,
Liang He
- Abstract summary: We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
- Score: 56.36916128631784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of various medical large language models (LLMs) in the medical
domain has highlighted the need for unified evaluation standards, as manual
evaluation of LLMs proves to be time-consuming and labor-intensive. To address
this issue, we introduce MedBench, a comprehensive benchmark for the Chinese
medical domain, comprising 40,041 questions sourced from authentic examination
exercises and medical reports of diverse branches of medicine. In particular,
this benchmark is composed of four key components: the Chinese Medical
Licensing Examination, the Resident Standardization Training Examination, the
Doctor In-Charge Qualification Examination, and real-world clinic cases
encompassing examinations, diagnoses, and treatments. MedBench replicates the
educational progression and clinical practice experiences of doctors in
Mainland China, thereby establishing itself as a credible benchmark for
assessing the mastery of knowledge and reasoning abilities in medical language
learning models. We perform extensive experiments and conduct an in-depth
analysis from diverse perspectives, which culminate in the following findings:
(1) Chinese medical LLMs underperform on this benchmark, highlighting the need
for significant advances in clinical knowledge and diagnostic precision. (2)
Several general-domain LLMs surprisingly possess considerable medical
knowledge. These findings elucidate both the capabilities and limitations of
LLMs within the context of MedBench, with the ultimate goal of aiding the
medical research community.
Related papers
- CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM.
First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties.
Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z) - MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge [4.8004472307210255]
Large language models (LLMs) have excelled across domains, delivering notable performance on medical evaluation benchmarks.
However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios.
We develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge.
arXiv Detail & Related papers (2024-06-05T04:15:07Z) - MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway
Encoding [48.348511646407026]
We introduce the Medical dialogue with Knowledge enhancement and clinical Pathway encoding framework.
The framework integrates an external knowledge enhancement module through a medical knowledge graph and an internal clinical pathway encoding via medical entities and physician actions.
arXiv Detail & Related papers (2024-03-11T10:57:45Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain [24.411904114158673]
We re-build the Chinese Biomedical Language Understanding Evaluation (CBlue) benchmark into a large scale prompt-tuning benchmark, PromptCBlue.
Our benchmark is a suitable test-bed and an online platform for evaluating Chinese LLMs' multi-task capabilities on a wide range bio-medical tasks.
arXiv Detail & Related papers (2023-10-22T02:20:38Z) - CMB: A Comprehensive Medical Benchmark in Chinese [67.69800156990952]
We propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese.
While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety.
We have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain.
arXiv Detail & Related papers (2023-08-17T07:51:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.