Related papers: Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

URL: http://arxiv.org/abs/2212.10422v3
Date: Wed, 28 Jun 2023 08:36:20 GMT
Title: Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models
Authors: Tommaso Mario Buonocore, Claudio Crema, Alberto Redolfi, Riccardo Bellazzi, Enea Parimbelli
Abstract summary: We present two approaches to derive biomedical language models in languages other than English. One is based on neural machine translation of English resources, favoring quantity over quality. The other is based on a high-grade, narrow-scoped corpus written in Italian, thus preferring quality over quantity.
Score: 0.987336898133886
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Related papers

Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality [1.6594309236462432]
This study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks.<n>We focus on three languages: Dutch, Romanian and Spanish.
arXiv Detail & Related papers (2025-10-31T15:28:01Z)
Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation [23.003923723432436]
This study systematically evaluated the impact of quantization on 12 state-of-the-art large language models.<n>We show that quantization substantially reduces GPU memory requirements-by up to 75%-while preserving model performance across diverse tasks.
arXiv Detail & Related papers (2025-09-04T04:18:45Z)
Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model [1.4843690728082002]
This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data.<n>We fine-tuned a baseline model using our curated data to improve its medical knowledge.<n> Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering.
arXiv Detail & Related papers (2025-05-21T20:30:47Z)
Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages [1.3699492682906507]
Language-specific models substantially outperformed both general and domain-specific models in generating radiology reports.<n>Models fine-tuned with medical terminology exhibited enhanced performance across all languages.
arXiv Detail & Related papers (2025-05-02T08:14:03Z)
Towards Holistic Disease Risk Prediction using Small Language Models [2.137491464843808]
We introduce a framework that connects small language models to multiple data sources, aiming to predict the risk of various diseases simultaneously. Our experiments encompass 12 different tasks within a multitask learning setup.
arXiv Detail & Related papers (2024-08-13T15:01:33Z)
Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts. Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models. The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z)
Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings [10.39989311209284]
We have conducted a comprehensive survey of language models in the medical field. We evaluated a subset of these for medical text classification and conditional text generation. The results reveal remarkable performance across the tasks and evaluated, underscoring the potential of certain models to contain medical knowledge.
arXiv Detail & Related papers (2024-06-24T12:52:02Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
DAEDRA: A language model for predicting outcomes in passive pharmacovigilance reporting [0.0]
DAEDRA is a large language model designed to detect regulatory-relevant outcomes in adverse event reports. This paper details the conception, design, training and evaluation of DAEDRA.
arXiv Detail & Related papers (2024-02-10T16:48:45Z)
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation [22.986061896641083]
MedEval is a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data.
arXiv Detail & Related papers (2023-10-21T18:59:41Z)
Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design. Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z)
Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices. In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z)
Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL) We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z)
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.