Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models
- URL: http://arxiv.org/abs/2402.11217v1
- Date: Sat, 17 Feb 2024 08:04:23 GMT
- Title: Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models
- Authors: Wenxuan Wang, Yihang Su, Jingyuan Huan, Jie Liu, Wenting Chen, Yudi
Zhang, Cheng-Yi Li, Kao-Jung Chang, Xiaohan Xin, Linlin Shen, Michael R. Lyu
- Abstract summary: We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
- Score: 59.60384461302662
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The significant breakthroughs of Medical Multi-Modal Large Language Models
(Med-MLLMs) renovate modern healthcare with robust information synthesis and
medical decision support. However, these models are often evaluated on
benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of
the real-world diagnostic frameworks, which encompass diverse medical
specialties and involve complex clinical decisions. Moreover, these benchmarks
are susceptible to data leakage, since Med-MLLMs are trained on large
assemblies of publicly available data. Thus, an isolated and clinically
representative benchmark is highly desirable for credible Med-MLLMs evaluation.
To this end, we introduce Asclepius, a novel Med-MLLM benchmark that rigorously
and comprehensively assesses model capability in terms of: distinct medical
specialties (cardiovascular, gastroenterology, etc.) and different diagnostic
capacities (perception, disease analysis, etc.). Grounded in 3 proposed core
principles, Asclepius ensures a comprehensive evaluation by encompassing 15
medical specialties, stratifying into 3 main categories and 8 sub-categories of
clinical tasks, and exempting from train-validate contamination. We further
provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human
specialists, providing insights into their competencies and limitations in
various medical contexts. Our work not only advances the understanding of
Med-MLLMs' capabilities but also sets a precedent for future evaluations and
the safe deployment of these models in clinical environments. We launch and
maintain a leaderboard for community assessment of Med-MLLM capabilities
(https://asclepius-med.github.io/).
Related papers
- Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - MedExQA: Medical Question Answering Benchmark with Multiple Explanations [2.2246416434538308]
This paper introduces MedExQA, a novel benchmark in medical question-answering to evaluate large language models' (LLMs) understanding of medical knowledge through explanations.
By constructing datasets across five distinct medical specialties, we address a major gap in current medical QA benchmarks.
Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology.
arXiv Detail & Related papers (2024-06-10T14:47:04Z) - MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge [4.8004472307210255]
Large language models (LLMs) have excelled across domains, delivering notable performance on medical evaluation benchmarks.
However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios.
We develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge.
arXiv Detail & Related papers (2024-06-05T04:15:07Z) - COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain [1.6752458252726457]
Large Language Models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence (AI) technology.
We outline Cognitive Network Evaluation Toolkit for Medical Domains (COGNET-MD)
We propose a scoring-framework with increased difficulty to assess the ability of LLMs in interpreting medical text.
arXiv Detail & Related papers (2024-05-17T16:31:56Z) - Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages.
We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench.
Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.