MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large
Language Models in Medicine
- URL: http://arxiv.org/abs/2305.07340v1
- Date: Fri, 12 May 2023 09:37:13 GMT
- Title: MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large
Language Models in Medicine
- Authors: Jie Xu, Lu Lu, Sen Yang, Bilin Liang, Xinwei Peng, Jiali Pang, Jinru
Ding, Xiaoming Shi, Lingrui Yang, Huan Song, Kang Li, Xin Sun, Shaoting Zhang
- Abstract summary: A set of evaluation criteria is designed based on a comprehensive literature review.
Existing candidate criteria are optimized for using a Delphi method by five experts in medicine and engineering.
Three chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory.
- Score: 16.75133391080187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: METHODS: First, a set of evaluation criteria is designed based on a
comprehensive literature review. Second, existing candidate criteria are
optimized for using a Delphi method by five experts in medicine and
engineering. Third, three clinical experts design a set of medical datasets to
interact with LLMs. Finally, benchmarking experiments are conducted on the
datasets. The responses generated by chatbots based on LLMs are recorded for
blind evaluations by five licensed medical experts. RESULTS: The obtained
evaluation criteria cover medical professional capabilities, social
comprehensive capabilities, contextual capabilities, and computational
robustness, with sixteen detailed indicators. The medical datasets include
twenty-seven medical dialogues and seven case reports in Chinese. Three
chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor
PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory. Experimental
results show that Dr. PJ outperforms ChatGPT and ERNIE Bot in both
multiple-turn medical dialogue and case report scenarios.
Related papers
- Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice.
We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA"
We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus.
We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - DocLens: Multi-aspect Fine-grained Evaluation for Medical Text
Generation [38.998563718476525]
We propose a set of metrics to evaluate the completeness, conciseness, and attribution of generated medical text.
The metrics can be computed by various types of evaluators including instruction-following (both proprietary and open-source) and supervised entailment models.
A comprehensive human study shows that DocLens exhibits substantially higher agreement with the judgments of medical experts than existing metrics.
arXiv Detail & Related papers (2023-11-16T05:32:09Z) - Integrating UMLS Knowledge into Large Language Models for Medical
Question Answering [18.06960842747575]
Large language models (LLMs) have demonstrated powerful text generation capabilities, bringing unprecedented innovation to the healthcare field.
We develop an augmented LLM framework based on the Unified Medical Language System (UMLS), aiming to better serve the healthcare community.
We employ LLaMa2-13b-chat and ChatGPT-3.5 as our benchmark models, and conduct automatic evaluations using the ROUGE Score and BERTScore on 104 questions from the LiveQA test set.
arXiv Detail & Related papers (2023-10-04T12:50:26Z) - A Benchmark for Automatic Medical Consultation System: Frameworks, Tasks
and Datasets [70.32630628211803]
We propose two frameworks to support automatic medical consultation, namely doctor-patient dialogue understanding and task-oriented interaction.
A new large medical dialogue dataset with multi-level fine-grained annotations is introduced.
We report a set of benchmark results for each task, which shows the usability of the dataset and sets a baseline for future studies.
arXiv Detail & Related papers (2022-04-19T16:43:21Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.