Learning structures of the French clinical language:development and
validation of word embedding models using 21 million clinical reports from
electronic health records
- URL: http://arxiv.org/abs/2207.12940v1
- Date: Tue, 26 Jul 2022 14:46:34 GMT
- Title: Learning structures of the French clinical language:development and
validation of word embedding models using 21 million clinical reports from
electronic health records
- Authors: Basile Dura, Charline Jean, Xavier Tannier, Alice Calliger, Romain
Bey, Antoine Neuraz, R\'emi Flicoteaux
- Abstract summary: Methods based on transfer learning using pre-trained language models have achieved state-of-the-art results in most NLP applications.
We aimed to evaluate the impact of adapting a language model to French clinical reports on downstream medical NLP tasks.
- Score: 2.5709272341038027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background
Clinical studies using real-world data may benefit from exploiting clinical
reports, a particularly rich albeit unstructured medium. To that end, natural
language processing can extract relevant information. Methods based on transfer
learning using pre-trained language models have achieved state-of-the-art
results in most NLP applications; however, publicly available models lack
exposure to speciality-languages, especially in the medical field.
Objective
We aimed to evaluate the impact of adapting a language model to French
clinical reports on downstream medical NLP tasks.
Methods
We leveraged a corpus of 21M clinical reports collected from August 2017 to
July 2021 at the Greater Paris University Hospitals (APHP) to produce two
CamemBERT architectures on speciality language: one retrained from scratch and
the other using CamemBERT as its initialisation. We used two French annotated
medical datasets to compare our language models to the original CamemBERT
network, evaluating the statistical significance of improvement with the
Wilcoxon test.
Results
Our models pretrained on clinical reports increased the average F1-score on
APMed (an APHP-specific task) by 3 percentage points to 91%, a statistically
significant improvement. They also achieved performance comparable to the
original CamemBERT on QUAERO. These results hold true for the fine-tuned and
from-scratch versions alike, starting from very few pre-training samples.
Conclusions
We confirm previous literature showing that adapting generalist pre-train
language models such as CamenBERT on speciality corpora improves their
performance for downstream clinical NLP tasks. Our results suggest that
retraining from scratch does not induce a statistically significant performance
gain compared to fine-tuning.
Related papers
- DAEDRA: A language model for predicting outcomes in passive
pharmacovigilance reporting [0.0]
DAEDRA is a large language model designed to detect regulatory-relevant outcomes in adverse event reports.
This paper details the conception, design, training and evaluation of DAEDRA.
arXiv Detail & Related papers (2024-02-10T16:48:45Z) - Do We Still Need Clinical Language Models? [15.023633270864675]
We show that relatively small specialized clinical models substantially outperform all in-context learning approaches.
We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement.
arXiv Detail & Related papers (2023-02-16T05:08:34Z) - Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records.
We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data.
We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z) - Exploring the Value of Pre-trained Language Models for Clinical Named
Entity Recognition [6.917786124918387]
We compare Transformer models that are trained from scratch to fine-tuned BERT-based LLMs.
We examine the impact of an additional CRF layer on such models to encourage contextual learning.
arXiv Detail & Related papers (2022-10-23T16:27:31Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z) - Med7: a transferable clinical natural language processing model for
electronic health records [6.935142529928062]
We introduce a named-entity recognition model for clinical natural language processing.
The model is trained to recognise seven categories: drug names, route, frequency, dosage, strength, form, duration.
We evaluate the transferability of the developed model using the data from the Intensive Care Unit in the US to secondary care mental health records (CRIS) in the UK.
arXiv Detail & Related papers (2020-03-03T00:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.