EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural
Language Processing
- URL: http://arxiv.org/abs/2306.07373v1
- Date: Mon, 12 Jun 2023 18:56:25 GMT
- Title: EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural
Language Processing
- Authors: Iker de la Iglesia and Aitziber Atutxa and Koldo Gojenola and Ander
Barrena
- Abstract summary: We introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora.
We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain.
- Score: 2.370481325034443
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The utilization of clinical reports for various secondary purposes, including
health research and treatment monitoring, is crucial for enhancing patient
care. Natural Language Processing (NLP) tools have emerged as valuable assets
for extracting and processing relevant information from these reports. However,
the availability of specialized language models for the clinical domain in
Spanish has been limited.
In this paper, we introduce EriBERTa, a bilingual domain-specific language
model pre-trained on extensive medical and clinical corpora. We demonstrate
that EriBERTa outperforms previous Spanish language models in the clinical
domain, showcasing its superior capabilities in understanding medical texts and
extracting meaningful information. Moreover, EriBERTa exhibits promising
transfer learning abilities, allowing for knowledge transfer from one language
to another. This aspect is particularly beneficial given the scarcity of
Spanish clinical data.
Related papers
- Extracting and Transferring Abilities For Building Multi-lingual Ability-enhanced Large Language Models [104.96990850774566]
We propose a Multi-lingual Ability Extraction and Transfer approach, named as MAET.
Our key idea is to decompose and extract language-agnostic ability-related weights from large language models.
Experiment results show MAET can effectively and efficiently extract and transfer the advanced abilities, and outperform training-based baseline methods.
arXiv Detail & Related papers (2024-10-10T11:23:18Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - Fine-Tuned Large Language Models for Symptom Recognition from Spanish
Clinical Text [6.918493795610175]
This study is a shared task on the detection of symptoms, signs and findings in Spanish medical documents.
We combine a set of large language models fine-tuned with the data released by the organizers.
arXiv Detail & Related papers (2024-01-28T22:11:25Z) - Neural Machine Translation of Clinical Text: An Empirical Investigation
into Multilingual Pre-Trained Language Models and Transfer-Learning [6.822926897514793]
Experimental results on three subtasks including 1) clinical case (CC), 2) clinical terminology (CT), and 3) ontological concept (OC)
Our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data.
The transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish.
arXiv Detail & Related papers (2023-12-12T13:26:42Z) - Multilingual Clinical NER: Translation or Cross-lingual Transfer? [4.4924444466378555]
We show that translation-based methods can achieve similar performance to cross-lingual transfer.
We release MedNERF a medical NER test set extracted from French drug prescriptions and annotated with the same guidelines as an English dataset.
arXiv Detail & Related papers (2023-06-07T12:31:07Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - Pre-training via Leveraging Assisting Languages and Data Selection for
Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest.
A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.