Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing
- URL: http://arxiv.org/abs/2007.15779v6
- Date: Thu, 16 Sep 2021 21:26:07 GMT
- Title: Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing
- Authors: Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong
Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon
- Abstract summary: We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
- Score: 73.37262264915739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretraining large neural language models, such as BERT, has led to impressive
gains on many natural language processing (NLP) tasks. However, most
pretraining efforts focus on general domain corpora, such as newswire and Web.
A prevailing assumption is that even domain-specific pretraining can benefit by
starting from general-domain language models. In this paper, we challenge this
assumption by showing that for domains with abundant unlabeled text, such as
biomedicine, pretraining language models from scratch results in substantial
gains over continual pretraining of general-domain language models. To
facilitate this investigation, we compile a comprehensive biomedical NLP
benchmark from publicly-available datasets. Our experiments show that
domain-specific pretraining serves as a solid foundation for a wide range of
biomedical NLP tasks, leading to new state-of-the-art results across the board.
Further, in conducting a thorough evaluation of modeling choices, both for
pretraining and task-specific fine-tuning, we discover that some common
practices are unnecessary with BERT models, such as using complex tagging
schemes in named entity recognition (NER). To help accelerate research in
biomedical NLP, we have released our state-of-the-art pretrained and
task-specific models for the community, and created a leaderboard featuring our
BLURB benchmark (short for Biomedical Language Understanding & Reasoning
Benchmark) at https://aka.ms/BLURB.
Related papers
- Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - BioBART: Pretraining and Evaluation of A Biomedical Generative Language
Model [1.1764594853212893]
In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain.
We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition.
BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks.
arXiv Detail & Related papers (2022-04-08T08:07:42Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - ELECTRAMed: a new pre-trained language representation model for
biomedical NLP [0.0]
We propose a pre-trained domain-specific language model, called ELECTRAMed, suited for the biomedical field.
The novel approach inherits the learning framework of the general-domain ELECTRA architecture, as well as its computational advantages.
arXiv Detail & Related papers (2021-04-19T19:38:34Z) - Pre-training technique to localize medical BERT and enhance biomedical
BERT [0.0]
It is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size.
We propose a single intervention with one option: simultaneous pre-training after up-sampling and amplified vocabulary.
Our Japanese medical BERT outperformed conventional baselines and the other BERT models in terms of the medical document classification task.
arXiv Detail & Related papers (2020-05-14T18:00:01Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.