Bioformer: an efficient transformer language model for biomedical text
mining
- URL: http://arxiv.org/abs/2302.01588v1
- Date: Fri, 3 Feb 2023 08:04:59 GMT
- Title: Bioformer: an efficient transformer language model for biomedical text
mining
- Authors: Li Fang, Qingyu Chen, Chih-Hsuan Wei, Zhiyong Lu, Kai Wang
- Abstract summary: We present Bioformer, a compact BERT model for biomedical text mining.
We pretrained two Bioformer models which reduced the model size by 60% compared to BERTBase.
With 60% fewer parameters, Bioformer16L is only 0.1% less accurate than PubMedBERT.
- Score: 8.961510810015643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained language models such as Bidirectional Encoder Representations from
Transformers (BERT) have achieved state-of-the-art performance in natural
language processing (NLP) tasks. Recently, BERT has been adapted to the
biomedical domain. Despite the effectiveness, these models have hundreds of
millions of parameters and are computationally expensive when applied to
large-scale NLP applications. We hypothesized that the number of parameters of
the original BERT can be dramatically reduced with minor impact on performance.
In this study, we present Bioformer, a compact BERT model for biomedical text
mining. We pretrained two Bioformer models (named Bioformer8L and Bioformer16L)
which reduced the model size by 60% compared to BERTBase. Bioformer uses a
biomedical vocabulary and was pre-trained from scratch on PubMed abstracts and
PubMed Central full-text articles. We thoroughly evaluated the performance of
Bioformer as well as existing biomedical BERT models including BioBERT and
PubMedBERT on 15 benchmark datasets of four different biomedical NLP tasks:
named entity recognition, relation extraction, question answering and document
classification. The results show that with 60% fewer parameters, Bioformer16L
is only 0.1% less accurate than PubMedBERT while Bioformer8L is 0.9% less
accurate than PubMedBERT. Both Bioformer16L and Bioformer8L outperformed
BioBERTBase-v1.1. In addition, Bioformer16L and Bioformer8L are 2-3 fold as
fast as PubMedBERT/BioBERTBase-v1.1. Bioformer has been successfully deployed
to PubTator Central providing gene annotations over 35 million PubMed abstracts
and 5 million PubMed Central full-text articles. We make Bioformer publicly
available via https://github.com/WGLab/bioformer, including pre-trained models,
datasets, and instructions for downstream use.
Related papers
- BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval.
BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z) - BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text [82.7001841679981]
BioMedLM is a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles.
When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with larger models.
BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics.
arXiv Detail & Related papers (2024-03-27T10:18:21Z) - BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning [77.90250740041411]
This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery.
BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data.
arXiv Detail & Related papers (2024-02-27T12:43:09Z) - BioAug: Conditional Generation based Data Augmentation for Low-Resource
Biomedical NER [52.79573512427998]
We present BioAug, a novel data augmentation framework for low-resource BioNER.
BioAug is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation.
We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets.
arXiv Detail & Related papers (2023-05-18T02:04:38Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - On the Effectiveness of Compact Biomedical Transformers [12.432191400869002]
Language models pre-trained on biomedical corpora have recently shown promising results on downstream biomedical tasks.
Many existing pre-trained models are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension, and number of layers.
We introduce six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT, and CompactBioBERT.
We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create efficient lightweight models that perform on par with their larger counterparts.
arXiv Detail & Related papers (2022-09-07T14:24:04Z) - Multi-label topic classification for COVID-19 literature with Bioformer [5.552371779218602]
We describe Bioformer team's participation in the multi-label topic classification task for COVID-19 literature.
We formulate the topic classification task as a sentence pair classification problem, where the title is the first sentence, and the abstract is the second sentence.
Compared to the baseline results, our best model increased micro, macro, and instance-based F1 score by 8.8%, 15.5%, 7.4%, respectively.
arXiv Detail & Related papers (2022-04-14T05:24:54Z) - Benchmarking for Biomedical Natural Language Processing Tasks with a
Domain Specific ALBERT [9.8215089151757]
We present BioALBERT, a domain-specific adaptation of A Lite Bidirectional Representations from Transformers (ALBERT)
It is trained on biomedical and PubMed Central and clinical corpora and fine tuned for 6 different tasks across 20 benchmark datasets.
It represents a new state of the art in 17 out of 20 benchmark datasets.
arXiv Detail & Related papers (2021-07-09T11:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.