MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation
- URL: http://arxiv.org/abs/2310.14088v3
- Date: Tue, 14 Nov 2023 21:59:56 GMT
- Title: MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation
- Authors: Zexue He, Yu Wang, An Yan, Yao Liu, Eric Y. Chang, Amilcare Gentili,
Julian McAuley, Chun-Nan Hsu
- Abstract summary: MedEval is a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare.
With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data.
- Score: 22.986061896641083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Curated datasets for healthcare are often limited due to the need of human
annotations from experts. In this paper, we present MedEval, a multi-level,
multi-task, and multi-domain medical benchmark to facilitate the development of
language models for healthcare. MedEval is comprehensive and consists of data
from several healthcare systems and spans 35 human body regions from 8
examination modalities. With 22,779 collected sentences and 21,228 reports, we
provide expert annotations at multiple levels, offering a granular potential
usage of the data and supporting a wide range of tasks. Moreover, we
systematically evaluated 10 generic and domain-specific language models under
zero-shot and finetuning settings, from domain-adapted baselines in healthcare
to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our
evaluations reveal varying effectiveness of the two categories of language
models across different tasks, from which we notice the importance of
instruction tuning for few-shot usage of large language models. Our
investigation paves the way toward benchmarking language models for healthcare
and provides valuable insights into the strengths and limitations of adopting
large language models in medical domains, informing their practical
applications and future advancements.
Related papers
- A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data.
Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied.
We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z) - Towards Holistic Disease Risk Prediction using Small Language Models [2.137491464843808]
We introduce a framework that connects small language models to multiple data sources, aiming to predict the risk of various diseases simultaneously.
Our experiments encompass 12 different tasks within a multitask learning setup.
arXiv Detail & Related papers (2024-08-13T15:01:33Z) - Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings [10.39989311209284]
We have conducted a comprehensive survey of language models in the medical field.
We evaluated a subset of these for medical text classification and conditional text generation.
The results reveal remarkable performance across the tasks and evaluated, underscoring the potential of certain models to contain medical knowledge.
arXiv Detail & Related papers (2024-06-24T12:52:02Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages.
We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench.
Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z) - DrBenchmark: A Large Language Understanding Evaluation Benchmark for
French Biomedical Domain [8.246368441549967]
We present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark.
It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification.
We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specifics to assess their cross-lingual capabilities.
arXiv Detail & Related papers (2024-02-20T23:54:02Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - Localising In-Domain Adaptation of Transformer-Based Biomedical Language
Models [0.987336898133886]
We present two approaches to derive biomedical language models in languages other than English.
One is based on neural machine translation of English resources, favoring quantity over quality.
The other is based on a high-grade, narrow-scoped corpus written in Italian, thus preferring quality over quantity.
arXiv Detail & Related papers (2022-12-20T16:59:56Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.