Localising In-Domain Adaptation of Transformer-Based Biomedical Language
Models
- URL: http://arxiv.org/abs/2212.10422v3
- Date: Wed, 28 Jun 2023 08:36:20 GMT
- Title: Localising In-Domain Adaptation of Transformer-Based Biomedical Language
Models
- Authors: Tommaso Mario Buonocore, Claudio Crema, Alberto Redolfi, Riccardo
Bellazzi, Enea Parimbelli
- Abstract summary: We present two approaches to derive biomedical language models in languages other than English.
One is based on neural machine translation of English resources, favoring quantity over quality.
The other is based on a high-grade, narrow-scoped corpus written in Italian, thus preferring quality over quantity.
- Score: 0.987336898133886
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the era of digital healthcare, the huge volumes of textual information
generated every day in hospitals constitute an essential but underused asset
that could be exploited with task-specific, fine-tuned biomedical language
representation models, improving patient care and management. For such
specialized domains, previous research has shown that fine-tuning models
stemming from broad-coverage checkpoints can largely benefit additional
training rounds over large-scale in-domain resources. However, these resources
are often unreachable for less-resourced languages like Italian, preventing
local medical institutions to employ in-domain adaptation. In order to reduce
this gap, our work investigates two accessible approaches to derive biomedical
language models in languages other than English, taking Italian as a concrete
use-case: one based on neural machine translation of English resources,
favoring quantity over quality; the other based on a high-grade, narrow-scoped
corpus natively written in Italian, thus preferring quality over quantity. Our
study shows that data quantity is a harder constraint than data quality for
biomedical adaptation, but the concatenation of high-quality data can improve
model performance even when dealing with relatively size-limited corpora. The
models published from our investigations have the potential to unlock important
research opportunities for Italian hospitals and academia. Finally, the set of
lessons learned from the study constitutes valuable insights towards a solution
to build biomedical language models that are generalizable to other
less-resourced languages and different domain settings.
Related papers
- Towards Holistic Disease Risk Prediction using Small Language Models [2.137491464843808]
We introduce a framework that connects small language models to multiple data sources, aiming to predict the risk of various diseases simultaneously.
Our experiments encompass 12 different tasks within a multitask learning setup.
arXiv Detail & Related papers (2024-08-13T15:01:33Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings [10.39989311209284]
We have conducted a comprehensive survey of language models in the medical field.
We evaluated a subset of these for medical text classification and conditional text generation.
The results reveal remarkable performance across the tasks and evaluated, underscoring the potential of certain models to contain medical knowledge.
arXiv Detail & Related papers (2024-06-24T12:52:02Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - DAEDRA: A language model for predicting outcomes in passive
pharmacovigilance reporting [0.0]
DAEDRA is a large language model designed to detect regulatory-relevant outcomes in adverse event reports.
This paper details the conception, design, training and evaluation of DAEDRA.
arXiv Detail & Related papers (2024-02-10T16:48:45Z) - MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation [22.986061896641083]
MedEval is a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare.
With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data.
arXiv Detail & Related papers (2023-10-21T18:59:41Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.