Pre-training technique to localize medical BERT and enhance biomedical
BERT
- URL: http://arxiv.org/abs/2005.07202v3
- Date: Thu, 25 Feb 2021 07:00:58 GMT
- Title: Pre-training technique to localize medical BERT and enhance biomedical
BERT
- Authors: Shoya Wada, Toshihiro Takeda, Shiro Manabe, Shozo Konishi, Jun
Kamohara, and Yasushi Matsumura
- Abstract summary: It is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size.
We propose a single intervention with one option: simultaneous pre-training after up-sampling and amplified vocabulary.
Our Japanese medical BERT outperformed conventional baselines and the other BERT models in terms of the medical document classification task.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training large-scale neural language models on raw texts has made a
significant contribution to improving transfer learning in natural language
processing (NLP). With the introduction of transformer-based language models,
such as bidirectional encoder representations from transformers (BERT), the
performance of information extraction from a free text by NLP has significantly
improved for both the general domain and medical domain; however, it is
difficult to train specific BERT models that perform well for domains in which
there are few publicly available databases of high quality and large size. We
hypothesized that this problem can be addressed by up-sampling a
domain-specific corpus and using it for pre-training with a larger corpus in a
balanced manner. Our proposed method consists of a single intervention with one
option: simultaneous pre-training after up-sampling and amplified vocabulary.
We conducted three experiments and evaluated the resulting products. We
confirmed that our Japanese medical BERT outperformed conventional baselines
and the other BERT models in terms of the medical document classification task
and that our English BERT pre-trained using both the general and medical-domain
corpora performed sufficiently well for practical use in terms of the
biomedical language understanding evaluation (BLUE) benchmark. Moreover, our
enhanced biomedical BERT model, in which clinical notes were not used during
pre-training, showed that both the clinical and biomedical scores of the BLUE
benchmark were 0.3 points above that of the ablation model trained without our
proposed method. Well-balanced pre-training by up-sampling instances derived
from a corpus appropriate for the target task allows us to construct a
high-performance BERT model.
Related papers
- Pre-training data selection for biomedical domain adaptation using journal impact metrics [0.0]
We employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set.
Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model's performance.
arXiv Detail & Related papers (2024-09-04T13:59:48Z) - Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction.
We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z) - BIOptimus: Pre-training an Optimal Biomedical Language Model with
Curriculum Learning for Named Entity Recognition [0.0]
Using language models (LMs) pre-trained in a self-supervised setting on large corpora has helped to deal with the problem of limited label data.
Recent research in biomedical language processing has offered a number of biomedical LMs pre-trained.
This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion.
arXiv Detail & Related papers (2023-08-16T18:48:01Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - Evaluating Biomedical BERT Models for Vocabulary Alignment at Scale in
the UMLS Metathesaurus [8.961270657070942]
The current UMLS (Unified Medical Language System) Metathesaurus construction process is expensive and error-prone.
Recent advances in Natural Language Processing have achieved state-of-the-art (SOTA) performance on downstream tasks.
We aim to validate if approaches using the BERT models can actually outperform the existing approaches for predicting synonymy in the UMLS Metathesaurus.
arXiv Detail & Related papers (2021-09-14T16:52:16Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.