Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian
- URL: http://arxiv.org/abs/2407.06011v1
- Date: Mon, 8 Jul 2024 15:04:21 GMT
- Title: Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian
- Authors: Tommaso Mario Buonocore, Simone Rancati, Enea Parimbelli,
- Abstract summary: This paper introduces Igea, the first decoder-only language model designed explicitly for biomedical text generation in Italian.
Igea is available in three model sizes: 350 million, 1 billion, and 3 billion parameters.
We evaluate Igea using a mix of in-domain biomedical corpora and general-purpose benchmarks, highlighting its efficacy and retention of general knowledge even after the domain-specific training.
- Score: 0.1474723404975345
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The development of domain-specific language models has significantly advanced natural language processing applications in various specialized fields, particularly in biomedicine. However, the focus has largely been on English-language models, leaving a gap for less-resourced languages such as Italian. This paper introduces Igea, the first decoder-only language model designed explicitly for biomedical text generation in Italian. Built on the Minerva model and continually pretrained on a diverse corpus of Italian medical texts, Igea is available in three model sizes: 350 million, 1 billion, and 3 billion parameters. The models aim to balance computational efficiency and performance, addressing the challenges of managing the peculiarities of medical terminology in Italian. We evaluate Igea using a mix of in-domain biomedical corpora and general-purpose benchmarks, highlighting its efficacy and retention of general knowledge even after the domain-specific training. This paper discusses the model's development and evaluation, providing a foundation for future advancements in Italian biomedical NLP.
Related papers
- Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
for Language Model Evaluation [22.986061896641083]
MedEval is a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare.
With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data.
arXiv Detail & Related papers (2023-10-21T18:59:41Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - Localising In-Domain Adaptation of Transformer-Based Biomedical Language
Models [0.987336898133886]
We present two approaches to derive biomedical language models in languages other than English.
One is based on neural machine translation of English resources, favoring quantity over quality.
The other is based on a high-grade, narrow-scoped corpus written in Italian, thus preferring quality over quantity.
arXiv Detail & Related papers (2022-12-20T16:59:56Z) - RuBioRoBERTa: a pre-trained biomedical language model for Russian
language biomedical text mining [117.56261821197741]
We present several BERT-based models for Russian language biomedical text mining.
The models are pre-trained on a corpus of freely available texts in the Russian biomedical domain.
arXiv Detail & Related papers (2022-04-08T09:18:59Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - BioMegatron: Larger Biomedical Domain Language Model [10.861369276414525]
We study and evaluate several factors that can affect performance on domain language applications.
We show consistent improvements on benchmarks with our larger BioMegatron model trained on a larger domain corpus.
arXiv Detail & Related papers (2020-10-12T22:46:10Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - CBAG: Conditional Biomedical Abstract Generation [1.2633386045916442]
We propose a transformer-based conditional language model with a shallow encoder "condition" stack, and a deep "language model" stack of multi-headed attention blocks.
We generate biomedical abstracts given only a proposed title, an intended publication year, and a set of keywords.
arXiv Detail & Related papers (2020-02-13T17:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.