MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation
- URL: http://arxiv.org/abs/2310.14088v3
- Date: Tue, 14 Nov 2023 21:59:56 GMT
- Title: MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation
- Authors: Zexue He, Yu Wang, An Yan, Yao Liu, Eric Y. Chang, Amilcare Gentili,
Julian McAuley, Chun-Nan Hsu
- Abstract summary: MedEval is a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare.
With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data.
- Score: 22.986061896641083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Curated datasets for healthcare are often limited due to the need of human
annotations from experts. In this paper, we present MedEval, a multi-level,
multi-task, and multi-domain medical benchmark to facilitate the development of
language models for healthcare. MedEval is comprehensive and consists of data
from several healthcare systems and spans 35 human body regions from 8
examination modalities. With 22,779 collected sentences and 21,228 reports, we
provide expert annotations at multiple levels, offering a granular potential
usage of the data and supporting a wide range of tasks. Moreover, we
systematically evaluated 10 generic and domain-specific language models under
zero-shot and finetuning settings, from domain-adapted baselines in healthcare
to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our
evaluations reveal varying effectiveness of the two categories of language
models across different tasks, from which we notice the importance of
instruction tuning for few-shot usage of large language models. Our
investigation paves the way toward benchmarking language models for healthcare
and provides valuable insights into the strengths and limitations of adopting
large language models in medical domains, informing their practical
applications and future advancements.
Related papers
- Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings [10.39989311209284]
We provide a comprehensive survey of language models in the medical domain.
Our subset encompasses 53 models, ranging from 110 million to 13 billion parameters.
Our findings reveal remarkable performance across various tasks and datasets.
arXiv Detail & Related papers (2024-06-24T12:52:02Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages.
We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench.
Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z) - DrBenchmark: A Large Language Understanding Evaluation Benchmark for
French Biomedical Domain [8.246368441549967]
We present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark.
It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification.
We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specifics to assess their cross-lingual capabilities.
arXiv Detail & Related papers (2024-02-20T23:54:02Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - Localising In-Domain Adaptation of Transformer-Based Biomedical Language
Models [0.987336898133886]
We present two approaches to derive biomedical language models in languages other than English.
One is based on neural machine translation of English resources, favoring quantity over quality.
The other is based on a high-grade, narrow-scoped corpus written in Italian, thus preferring quality over quantity.
arXiv Detail & Related papers (2022-12-20T16:59:56Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.