Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models
- URL: http://arxiv.org/abs/2402.11217v1
- Date: Sat, 17 Feb 2024 08:04:23 GMT
- Title: Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models
- Authors: Wenxuan Wang, Yihang Su, Jingyuan Huan, Jie Liu, Wenting Chen, Yudi
Zhang, Cheng-Yi Li, Kao-Jung Chang, Xiaohan Xin, Linlin Shen, Michael R. Lyu
- Abstract summary: We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
- Score: 59.60384461302662
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The significant breakthroughs of Medical Multi-Modal Large Language Models
(Med-MLLMs) renovate modern healthcare with robust information synthesis and
medical decision support. However, these models are often evaluated on
benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of
the real-world diagnostic frameworks, which encompass diverse medical
specialties and involve complex clinical decisions. Moreover, these benchmarks
are susceptible to data leakage, since Med-MLLMs are trained on large
assemblies of publicly available data. Thus, an isolated and clinically
representative benchmark is highly desirable for credible Med-MLLMs evaluation.
To this end, we introduce Asclepius, a novel Med-MLLM benchmark that rigorously
and comprehensively assesses model capability in terms of: distinct medical
specialties (cardiovascular, gastroenterology, etc.) and different diagnostic
capacities (perception, disease analysis, etc.). Grounded in 3 proposed core
principles, Asclepius ensures a comprehensive evaluation by encompassing 15
medical specialties, stratifying into 3 main categories and 8 sub-categories of
clinical tasks, and exempting from train-validate contamination. We further
provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human
specialists, providing insights into their competencies and limitations in
various medical contexts. Our work not only advances the understanding of
Med-MLLMs' capabilities but also sets a precedent for future evaluations and
the safe deployment of these models in clinical environments. We launch and
maintain a leaderboard for community assessment of Med-MLLM capabilities
(https://asclepius-med.github.io/).
Related papers
- CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM.
First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties.
Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z) - MedExQA: Medical Question Answering Benchmark with Multiple Explanations [2.2246416434538308]
This paper introduces MedExQA, a novel benchmark in medical question-answering to evaluate large language models' (LLMs) understanding of medical knowledge through explanations.
By constructing datasets across five distinct medical specialties, we address a major gap in current medical QA benchmarks.
Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology.
arXiv Detail & Related papers (2024-06-10T14:47:04Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.