Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese
Medical Exam Dataset
- URL: http://arxiv.org/abs/2306.03030v3
- Date: Mon, 23 Oct 2023 02:55:08 GMT
- Title: Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese
Medical Exam Dataset
- Authors: Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian,
Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, Michael Lingzhi Li
- Abstract summary: We introduce CMExam, sourced from the Chinese National Medical Licensing Examination.
CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner.
For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels.
- Score: 31.047827145874844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in large language models (LLMs) have transformed the
field of question answering (QA). However, evaluating LLMs in the medical field
is challenging due to the lack of standardized and comprehensive datasets. To
address this gap, we introduce CMExam, sourced from the Chinese National
Medical Licensing Examination. CMExam consists of 60K+ multiple-choice
questions for standardized and objective evaluations, as well as solution
explanations for model reasoning evaluation in an open-ended manner. For
in-depth analyses of LLMs, we invited medical professionals to label five
additional question-wise annotations, including disease groups, clinical
departments, medical disciplines, areas of competency, and question difficulty
levels. Alongside the dataset, we further conducted thorough experiments with
representative LLMs and QA algorithms on CMExam. The results show that GPT-4
had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results
highlight a great disparity when compared to human accuracy, which stood at
71.6%. For explanation tasks, while LLMs could generate relevant reasoning and
demonstrate improved performance after finetuning, they fall short of a desired
standard, indicating ample room for improvement. To the best of our knowledge,
CMExam is the first Chinese medical exam dataset to provide comprehensive
medical annotations. The experiments and findings of LLM evaluation also
provide valuable insights into the challenges and potential solutions in
developing Chinese medical QA systems and LLM evaluation pipelines. The dataset
and relevant code are available at https://github.com/williamliujl/CMExam.
Related papers
- Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - MedExQA: Medical Question Answering Benchmark with Multiple Explanations [2.2246416434538308]
This paper introduces MedExQA, a novel benchmark in medical question-answering to evaluate large language models' (LLMs) understanding of medical knowledge through explanations.
By constructing datasets across five distinct medical specialties, we address a major gap in current medical QA benchmarks.
Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology.
arXiv Detail & Related papers (2024-06-10T14:47:04Z) - TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models [22.76485170022542]
We introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks.
Our TCMD collects massive questions across diverse domains with their annotated medical subjects.
arXiv Detail & Related papers (2024-06-07T13:48:15Z) - Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice.
We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA"
We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus.
We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z) - Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data [3.471944921180245]
We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex.
This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities.
We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
arXiv Detail & Related papers (2024-06-04T15:08:56Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.