Localising In-Domain Adaptation of Transformer-Based Biomedical Language
Models
- URL: http://arxiv.org/abs/2212.10422v3
- Date: Wed, 28 Jun 2023 08:36:20 GMT
- Title: Localising In-Domain Adaptation of Transformer-Based Biomedical Language
Models
- Authors: Tommaso Mario Buonocore, Claudio Crema, Alberto Redolfi, Riccardo
Bellazzi, Enea Parimbelli
- Abstract summary: We present two approaches to derive biomedical language models in languages other than English.
One is based on neural machine translation of English resources, favoring quantity over quality.
The other is based on a high-grade, narrow-scoped corpus written in Italian, thus preferring quality over quantity.
- Score: 0.987336898133886
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the era of digital healthcare, the huge volumes of textual information
generated every day in hospitals constitute an essential but underused asset
that could be exploited with task-specific, fine-tuned biomedical language
representation models, improving patient care and management. For such
specialized domains, previous research has shown that fine-tuning models
stemming from broad-coverage checkpoints can largely benefit additional
training rounds over large-scale in-domain resources. However, these resources
are often unreachable for less-resourced languages like Italian, preventing
local medical institutions to employ in-domain adaptation. In order to reduce
this gap, our work investigates two accessible approaches to derive biomedical
language models in languages other than English, taking Italian as a concrete
use-case: one based on neural machine translation of English resources,
favoring quantity over quality; the other based on a high-grade, narrow-scoped
corpus natively written in Italian, thus preferring quality over quantity. Our
study shows that data quantity is a harder constraint than data quality for
biomedical adaptation, but the concatenation of high-quality data can improve
model performance even when dealing with relatively size-limited corpora. The
models published from our investigations have the potential to unlock important
research opportunities for Italian hospitals and academia. Finally, the set of
lessons learned from the study constitutes valuable insights towards a solution
to build biomedical language models that are generalizable to other
less-resourced languages and different domain settings.
Related papers
- Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings [10.39989311209284]
We provide a comprehensive survey of language models in the medical domain.
Our subset encompasses 53 models, ranging from 110 million to 13 billion parameters.
Our findings reveal remarkable performance across various tasks and datasets.
arXiv Detail & Related papers (2024-06-24T12:52:02Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - DAEDRA: A language model for predicting outcomes in passive
pharmacovigilance reporting [0.0]
DAEDRA is a large language model designed to detect regulatory-relevant outcomes in adverse event reports.
This paper details the conception, design, training and evaluation of DAEDRA.
arXiv Detail & Related papers (2024-02-10T16:48:45Z) - MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation [22.986061896641083]
MedEval is a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare.
With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data.
arXiv Detail & Related papers (2023-10-21T18:59:41Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.