Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario
- URL: http://arxiv.org/abs/2109.03570v1
- Date: Wed, 8 Sep 2021 12:12:07 GMT
- Title: Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario
- Authors: Casimiro Pio Carrino, Jordi Armengol-Estap\'e, Asier
Guti\'errez-Fandi\~no, Joan Llop-Palao, Marc P\`amies, Aitor Gonzalez-Agirre,
Marta Villegas
- Abstract summary: This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
- Score: 0.05277024349608833
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This work presents biomedical and clinical language models for Spanish by
experimenting with different pretraining choices, such as masking at word and
subword level, varying the vocabulary size and testing with domain data,
looking for better language representations. Interestingly, in the absence of
enough clinical data to train a model from scratch, we applied mixed-domain
pretraining and cross-domain transfer approaches to generate a performant
bio-clinical model suitable for real-world clinical data. We evaluated our
models on Named Entity Recognition (NER) tasks for biomedical documents and
challenging hospital discharge reports. When compared against the competitive
mBERT and BETO models, we outperform them in all NER tasks by a significant
margin. Finally, we studied the impact of the model's vocabulary on the NER
performances by offering an interesting vocabulary-centric analysis. The
results confirm that domain-specific pretraining is fundamental to achieving
higher performances in downstream NER tasks, even within a mid-resource
scenario. To the best of our knowledge, we provide the first biomedical and
clinical transformer-based pretrained language models for Spanish, intending to
boost native Spanish NLP applications in biomedicine. Our models will be made
freely available after publication.
Related papers
- Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - Neural Machine Translation of Clinical Text: An Empirical Investigation
into Multilingual Pre-Trained Language Models and Transfer-Learning [6.822926897514793]
Experimental results on three subtasks including 1) clinical case (CC), 2) clinical terminology (CT), and 3) ontological concept (OC)
Our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data.
The transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish.
arXiv Detail & Related papers (2023-12-12T13:26:42Z) - UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for
Biomedical Entity Recognition [4.865221751784403]
This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS.
Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
arXiv Detail & Related papers (2023-07-20T18:08:34Z) - Developing a general-purpose clinical language inference model from a
large corpus of clinical notes [0.30586855806896046]
We trained a Bidomain Decoder from Transformers (BERT) model using a diverse, deidentified corpus of 75 million deidentified clinical notes authored at the University of California, San Francisco (UCSF)
Our model performs at par with the best publicly available biomedical language models of comparable sizes on the public benchmark tasks, and is significantly better than these models in a within-system evaluation on the two tasks using UCSF data.
arXiv Detail & Related papers (2022-10-12T20:08:45Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.