MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large
Language Models in Medicine
- URL: http://arxiv.org/abs/2305.07340v1
- Date: Fri, 12 May 2023 09:37:13 GMT
- Title: MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large
Language Models in Medicine
- Authors: Jie Xu, Lu Lu, Sen Yang, Bilin Liang, Xinwei Peng, Jiali Pang, Jinru
Ding, Xiaoming Shi, Lingrui Yang, Huan Song, Kang Li, Xin Sun, Shaoting Zhang
- Abstract summary: A set of evaluation criteria is designed based on a comprehensive literature review.
Existing candidate criteria are optimized for using a Delphi method by five experts in medicine and engineering.
Three chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory.
- Score: 16.75133391080187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: METHODS: First, a set of evaluation criteria is designed based on a
comprehensive literature review. Second, existing candidate criteria are
optimized for using a Delphi method by five experts in medicine and
engineering. Third, three clinical experts design a set of medical datasets to
interact with LLMs. Finally, benchmarking experiments are conducted on the
datasets. The responses generated by chatbots based on LLMs are recorded for
blind evaluations by five licensed medical experts. RESULTS: The obtained
evaluation criteria cover medical professional capabilities, social
comprehensive capabilities, contextual capabilities, and computational
robustness, with sixteen detailed indicators. The medical datasets include
twenty-seven medical dialogues and seven case reports in Chinese. Three
chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor
PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory. Experimental
results show that Dr. PJ outperforms ChatGPT and ERNIE Bot in both
multiple-turn medical dialogue and case report scenarios.
Related papers
- Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z) - LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [38.02853540388593]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z) - 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark [0.29987253996125257]
Large Vision-Language Models (LVLMs) are being explored for applications in telemedicine, yet their ability to engage with diverse patient behaviors remains underexplored.
We introduce 3MDBench, an open-source evaluation framework designed to assess LLM-driven medical consultations.
The benchmark integrates textual and image-based patient data across 34 common diagnoses, mirroring real-world telemedicine interactions.
arXiv Detail & Related papers (2025-03-26T07:32:05Z) - Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation [0.0]
Large language models (LLMs) have shown impressive capabilities in natural language processing tasks, including dialogue generation.
This research aims to conduct a novel comparative analysis of two prominent techniques, fine-tuning with LoRA and the Retrieval-Augmented Generation framework.
arXiv Detail & Related papers (2025-02-04T11:50:40Z) - MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [20.83722922095852]
MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems.<n> MM introduces expert-level exam questions with diverse images and rich clinical information.<n>We evaluate 18 leading models on benchmark.
arXiv Detail & Related papers (2025-01-30T14:07:56Z) - RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules.
We develop a medical dialogue dataset comprising rule-based communications between patients and physicians.
Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice.
We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA"
We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus.
We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation [37.58514130165496]
We propose a set of metrics to evaluate the completeness, conciseness, and attribution of generated medical text.
The metrics can be computed by various types of evaluators including instruction-following (both proprietary and open-source) and supervised entailment models.
A comprehensive human study shows that DocLens exhibits substantially higher agreement with the judgments of medical experts than existing metrics.
arXiv Detail & Related papers (2023-11-16T05:32:09Z) - Integrating UMLS Knowledge into Large Language Models for Medical
Question Answering [18.06960842747575]
Large language models (LLMs) have demonstrated powerful text generation capabilities, bringing unprecedented innovation to the healthcare field.
We develop an augmented LLM framework based on the Unified Medical Language System (UMLS), aiming to better serve the healthcare community.
We employ LLaMa2-13b-chat and ChatGPT-3.5 as our benchmark models, and conduct automatic evaluations using the ROUGE Score and BERTScore on 104 questions from the LiveQA test set.
arXiv Detail & Related papers (2023-10-04T12:50:26Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.