RuBioRoBERTa: a pre-trained biomedical language model for Russian
language biomedical text mining
- URL: http://arxiv.org/abs/2204.03951v1
- Date: Fri, 8 Apr 2022 09:18:59 GMT
- Title: RuBioRoBERTa: a pre-trained biomedical language model for Russian
language biomedical text mining
- Authors: Alexander Yalunin, Alexander Nesterov, and Dmitriy Umerenkov
- Abstract summary: We present several BERT-based models for Russian language biomedical text mining.
The models are pre-trained on a corpus of freely available texts in the Russian biomedical domain.
- Score: 117.56261821197741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents several BERT-based models for Russian language biomedical
text mining (RuBioBERT, RuBioRoBERTa). The models are pre-trained on a corpus
of freely available texts in the Russian biomedical domain. With this
pre-training, our models demonstrate state-of-the-art results on RuMedBench -
Russian medical language understanding benchmark that covers a diverse set of
tasks, including text classification, question answering, natural language
inference, and named entity recognition.
Related papers
- The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design [39.80182519545138]
This paper focuses on research related to embedding models in the Russian language.
It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark.
arXiv Detail & Related papers (2024-08-22T15:53:23Z) - Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian [0.1474723404975345]
This paper introduces Igea, the first decoder-only language model designed explicitly for biomedical text generation in Italian.
Igea is available in three model sizes: 350 million, 1 billion, and 3 billion parameters.
We evaluate Igea using a mix of in-domain biomedical corpora and general-purpose benchmarks, highlighting its efficacy and retention of general knowledge even after the domain-specific training.
arXiv Detail & Related papers (2024-07-08T15:04:21Z) - Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus [2.4686585810894477]
This paper presents the first evaluated biomedical entity linking model for the Dutch language.
We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context.
Our results indicate that biomedical entity linking in a language other than English remains challenging.
arXiv Detail & Related papers (2024-05-20T10:30:36Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - BioBART: Pretraining and Evaluation of A Biomedical Generative Language
Model [1.1764594853212893]
In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain.
We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition.
BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks.
arXiv Detail & Related papers (2022-04-08T08:07:42Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - Conceptualized Representation Learning for Chinese Biomedical Text
Mining [14.77516568767045]
We investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora.
For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations.
arXiv Detail & Related papers (2020-08-25T04:41:35Z) - A Multilingual Neural Machine Translation Model for Biomedical Data [84.17747489525794]
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain.
The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English.
It is trained with large amounts of generic and biomedical data, using domain tags.
arXiv Detail & Related papers (2020-08-06T21:26:43Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.