DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical
domains
- URL: http://arxiv.org/abs/2304.00958v2
- Date: Thu, 4 May 2023 19:59:38 GMT
- Title: DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical
domains
- Authors: Yanis Labrak and Adrien Bazoge and Richard Dufour and Mickael Rouvier
and Emmanuel Morin and B\'eatrice Daille and Pierre-Antoine Gourraud
- Abstract summary: We propose an original study of PLMs in the medical domain on French language.
We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments.
We show that we can take advantage of already existing biomedical PLMs in a foreign language by further pre-train it on our targeted data.
- Score: 4.989459243399296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, pre-trained language models (PLMs) achieve the best
performance on a wide range of natural language processing (NLP) tasks. While
the first models were trained on general domain data, specialized ones have
emerged to more effectively treat specific domains. In this paper, we propose
an original study of PLMs in the medical domain on French language. We compare,
for the first time, the performance of PLMs trained on both public data from
the web and private data from healthcare establishments. We also evaluate
different learning strategies on a set of biomedical tasks. In particular, we
show that we can take advantage of already existing biomedical PLMs in a
foreign language by further pre-train it on our targeted data. Finally, we
release the first specialized PLMs for the biomedical field in French, called
DrBERT, as well as the largest corpus of medical data under free license on
which these models are trained.
Related papers
- Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages.
We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench.
Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z) - BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains [8.448541067852]
Large Language Models (LLMs) have demonstrated remarkable versatility in recent years.
Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges.
We introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model.
arXiv Detail & Related papers (2024-02-15T23:39:04Z) - HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [61.41790586411816]
HuatuoGPT-II has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks.
It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine.
arXiv Detail & Related papers (2023-11-16T10:56:24Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data [1.1265248232450553]
Transfer learning with BERT-like models has allowed major advances for French, especially for named entity recognition.
We introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset.
Through continual pre-training, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks.
arXiv Detail & Related papers (2023-06-27T15:23:14Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.