Neural Machine Translation of Clinical Text: An Empirical Investigation
into Multilingual Pre-Trained Language Models and Transfer-Learning
- URL: http://arxiv.org/abs/2312.07250v2
- Date: Wed, 21 Feb 2024 12:08:47 GMT
- Title: Neural Machine Translation of Clinical Text: An Empirical Investigation
into Multilingual Pre-Trained Language Models and Transfer-Learning
- Authors: Lifeng Han, Serge Gladkoff, Gleb Erofeev, Irina Sorokina, Betty
Galiano, Goran Nenadic
- Abstract summary: Experimental results on three subtasks including 1) clinical case (CC), 2) clinical terminology (CT), and 3) ontological concept (OC)
Our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data.
The transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish.
- Score: 6.822926897514793
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We conduct investigations on clinical text machine translation by examining
multilingual neural network models using deep learning such as Transformer
based structures. Furthermore, to address the language resource imbalance
issue, we also carry out experiments using a transfer learning methodology
based on massive multilingual pre-trained language models (MMPLMs). The
experimental results on three subtasks including 1) clinical case (CC), 2)
clinical terminology (CT), and 3) ontological concept (OC) show that our models
achieved top-level performances in the ClinSpEn-2022 shared task on
English-Spanish clinical domain data. Furthermore, our expert-based human
evaluations demonstrate that the small-sized pre-trained language model (PLM)
won over the other two extra-large language models by a large margin, in the
clinical domain fine-tuning, which finding was never reported in the field.
Finally, the transfer learning method works well in our experimental setting
using the WMT21fb model to accommodate a new language space Spanish that was
not seen at the pre-training stage within WMT21fb itself, which deserves more
exploitation for clinical knowledge transformation, e.g. to investigate into
more languages. These research findings can shed some light on domain-specific
machine translation development, especially in clinical and healthcare fields.
Further research projects can be carried out based on our work to improve
healthcare text analytics and knowledge transformation. Our data will be openly
available for research purposes at https://github.com/HECTA-UoM/ClinicalNMT
Related papers
- UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for
Biomedical Entity Recognition [4.865221751784403]
This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS.
Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
arXiv Detail & Related papers (2023-07-20T18:08:34Z) - EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural
Language Processing [2.370481325034443]
We introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora.
We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain.
arXiv Detail & Related papers (2023-06-12T18:56:25Z) - Developing a general-purpose clinical language inference model from a
large corpus of clinical notes [0.30586855806896046]
We trained a Bidomain Decoder from Transformers (BERT) model using a diverse, deidentified corpus of 75 million deidentified clinical notes authored at the University of California, San Francisco (UCSF)
Our model performs at par with the best publicly available biomedical language models of comparable sizes on the public benchmark tasks, and is significantly better than these models in a within-system evaluation on the two tasks using UCSF data.
arXiv Detail & Related papers (2022-10-12T20:08:45Z) - Investigating Massive Multilingual Pre-Trained Machine Translation
Models for Clinical Domain via Transfer Learning [11.571189144910521]
This work investigates whether MMPLMs can be applied to clinical domain machine translation (MT) towards entirely unseen languages via transfer learning.
Massively multilingual pre-trained language models (MMPLMs) are developed in recent years demonstrating superpowers and the pre-knowledge they acquire for downstream tasks.
arXiv Detail & Related papers (2022-10-12T10:19:44Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.