Multi-label topic classification for COVID-19 literature with Bioformer
- URL: http://arxiv.org/abs/2204.06758v1
- Date: Thu, 14 Apr 2022 05:24:54 GMT
- Title: Multi-label topic classification for COVID-19 literature with Bioformer
- Authors: Li Fang, Kai Wang
- Abstract summary: We describe Bioformer team's participation in the multi-label topic classification task for COVID-19 literature.
We formulate the topic classification task as a sentence pair classification problem, where the title is the first sentence, and the abstract is the second sentence.
Compared to the baseline results, our best model increased micro, macro, and instance-based F1 score by 8.8%, 15.5%, 7.4%, respectively.
- Score: 5.552371779218602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe Bioformer team's participation in the multi-label topic
classification task for COVID-19 literature (track 5 of BioCreative VII). Topic
classification is performed using different BERT models (BioBERT, PubMedBERT,
and Bioformer). We formulate the topic classification task as a sentence pair
classification problem, where the title is the first sentence, and the abstract
is the second sentence. Our results show that Bioformer outperforms BioBERT and
PubMedBERT in this task. Compared to the baseline results, our best model
increased micro, macro, and instance-based F1 score by 8.8%, 15.5%, 7.4%,
respectively. Bioformer achieved the highest micro F1 and macro F1 scores in
this challenge. In post-challenge experiments, we found that pretraining of
Bioformer on COVID-19 articles further improves the performance.
Related papers
- Augmenting Biomedical Named Entity Recognition with General-domain Resources [47.24727904076347]
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations.
We propose GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training.
We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances.
arXiv Detail & Related papers (2024-06-15T15:28:02Z) - BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text [82.7001841679981]
BioMedLM is a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles.
When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with larger models.
BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics.
arXiv Detail & Related papers (2024-03-27T10:18:21Z) - BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning [77.90250740041411]
This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery.
BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data.
arXiv Detail & Related papers (2024-02-27T12:43:09Z) - BioAug: Conditional Generation based Data Augmentation for Low-Resource
Biomedical NER [52.79573512427998]
We present BioAug, a novel data augmentation framework for low-resource BioNER.
BioAug is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation.
We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets.
arXiv Detail & Related papers (2023-05-18T02:04:38Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - Bioformer: an efficient transformer language model for biomedical text
mining [8.961510810015643]
We present Bioformer, a compact BERT model for biomedical text mining.
We pretrained two Bioformer models which reduced the model size by 60% compared to BERTBase.
With 60% fewer parameters, Bioformer16L is only 0.1% less accurate than PubMedBERT.
arXiv Detail & Related papers (2023-02-03T08:04:59Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Multi-label classification for biomedical literature: an overview of the
BioCreative VII LitCovid Track for COVID-19 literature topic annotations [13.043042862575192]
The BioCreative LitCovid track calls for a community effort to tackle automated topic annotation for COVID-19 literature.
The dataset consists of over 30,000 articles with manually reviewed topics.
The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score.
arXiv Detail & Related papers (2022-04-20T20:47:55Z) - BagBERT: BERT-based bagging-stacking for multi-topic classification [0.0]
We propose an approach that exploits the knowledge of the globally non-optimal weights, usually rejected, to build a rich representation of each label.
The aggregation of these weak insights performs better than a classical globally efficient model.
Our system obtains an Instance-based F1 of 92.96 and a Label-based micro-F1 of 91.35.
arXiv Detail & Related papers (2021-11-10T17:00:36Z) - BioNerFlair: biomedical named entity recognition using flair embedding
and sequence tagger [0.0]
We introduce BioNerFlair, a method to train models for biomedical named entity recognition.
With almost the same generic architecture widely used for named entity recognition, BioNerFlair outperforms previous state-of-the-art models.
arXiv Detail & Related papers (2020-11-03T06:46:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.