DrBenchmark: A Large Language Understanding Evaluation Benchmark for
French Biomedical Domain
- URL: http://arxiv.org/abs/2402.13432v1
- Date: Tue, 20 Feb 2024 23:54:02 GMT
- Title: DrBenchmark: A Large Language Understanding Evaluation Benchmark for
French Biomedical Domain
- Authors: Yanis Labrak, Adrien Bazoge, Oumaima El Khettari, Mickael Rouvier,
Pacome Constant dit Beaufils, Natalia Grabar, Beatrice Daille, Solen Quiniou,
Emmanuel Morin, Pierre-Antoine Gourraud, Richard Dufour
- Abstract summary: We present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark.
It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification.
We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specifics to assess their cross-lingual capabilities.
- Score: 8.246368441549967
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The biomedical domain has sparked a significant interest in the field of
Natural Language Processing (NLP), which has seen substantial advancements with
pre-trained language models (PLMs). However, comparing these models has proven
challenging due to variations in evaluation protocols across different models.
A fair solution is to aggregate diverse downstream tasks into a benchmark,
allowing for the assessment of intrinsic PLMs qualities from various
perspectives. Although still limited to few languages, this initiative has been
undertaken in the biomedical field, notably English and Chinese. This
limitation hampers the evaluation of the latest French biomedical models, as
they are either assessed on a minimal number of tasks with non-standardized
protocols or evaluated using general downstream tasks. To bridge this research
gap and account for the unique sensitivities of French, we present the
first-ever publicly available French biomedical language understanding
benchmark called DrBenchmark. It encompasses 20 diversified tasks, including
named-entity recognition, part-of-speech tagging, question-answering, semantic
textual similarity, and classification. We evaluate 8 state-of-the-art
pre-trained masked language models (MLMs) on general and biomedical-specific
data, as well as English specific MLMs to assess their cross-lingual
capabilities. Our experiments reveal that no single model excels across all
tasks, while generalist models are sometimes still competitive.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - A Benchmark Evaluation of Clinical Named Entity Recognition in French [4.430193084761607]
We evaluate biomedical models CamemBERT-bio and DrBERT and compare them to standard French models CamemBERT, FlauBERT and FrALBERT.
Results suggest that CamemBERT-bio outperforms DrBERT consistently while FlauBERT offers competitive performance and FrAlBERT achieves the lowest carbonprint.
arXiv Detail & Related papers (2024-03-28T07:59:58Z) - BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains [8.448541067852]
Large Language Models (LLMs) have demonstrated remarkable versatility in recent years.
Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges.
We introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model.
arXiv Detail & Related papers (2024-02-15T23:39:04Z) - PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain [24.411904114158673]
We re-build the Chinese Biomedical Language Understanding Evaluation (CBlue) benchmark into a large scale prompt-tuning benchmark, PromptCBlue.
Our benchmark is a suitable test-bed and an online platform for evaluating Chinese LLMs' multi-task capabilities on a wide range bio-medical tasks.
arXiv Detail & Related papers (2023-10-22T02:20:38Z) - MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation [22.986061896641083]
MedEval is a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare.
With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data.
arXiv Detail & Related papers (2023-10-21T18:59:41Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.