BioAug: Conditional Generation based Data Augmentation for Low-Resource
Biomedical NER
- URL: http://arxiv.org/abs/2305.10647v1
- Date: Thu, 18 May 2023 02:04:38 GMT
- Title: BioAug: Conditional Generation based Data Augmentation for Low-Resource
Biomedical NER
- Authors: Sreyan Ghosh and Utkarsh Tyagi and Sonal Kumar and Dinesh Manocha
- Abstract summary: We present BioAug, a novel data augmentation framework for low-resource BioNER.
BioAug is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation.
We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets.
- Score: 52.79573512427998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biomedical Named Entity Recognition (BioNER) is the fundamental task of
identifying named entities from biomedical text. However, BioNER suffers from
severe data scarcity and lacks high-quality labeled data due to the highly
specialized and expert knowledge required for annotation. Though data
augmentation has shown to be highly effective for low-resource NER in general,
existing data augmentation techniques fail to produce factual and diverse
augmentations for BioNER. In this paper, we present BioAug, a novel data
augmentation framework for low-resource BioNER. BioAug, built on BART, is
trained to solve a novel text reconstruction task based on selective masking
and knowledge augmentation. Post training, we perform conditional generation
and generate diverse augmentations conditioning BioAug on selectively corrupted
text similar to the training stage. We demonstrate the effectiveness of BioAug
on 5 benchmark BioNER datasets and show that BioAug outperforms all our
baselines by a significant margin (1.5%-21.5% absolute improvement) and is able
to generate augmentations that are both more factual and diverse. Code:
https://github.com/Sreyan88/BioAug.
Related papers
- Augmenting Biomedical Named Entity Recognition with General-domain Resources [47.24727904076347]
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations.
We propose GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training.
We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances.
arXiv Detail & Related papers (2024-06-15T15:28:02Z) - BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval.
BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z) - BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning [77.90250740041411]
This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery.
BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data.
arXiv Detail & Related papers (2024-02-27T12:43:09Z) - BioT5: Enriching Cross-modal Integration in Biology with Chemical
Knowledge and Natural Language Associations [54.97423244799579]
$mathbfBioT5$ is a pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations.
$mathbfBioT5$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information.
arXiv Detail & Related papers (2023-10-11T07:57:08Z) - AIONER: All-in-one scheme-based biomedical named entity recognition
using deep learning [7.427654811697884]
We present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema.
AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning.
arXiv Detail & Related papers (2022-11-30T12:35:00Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - BIOS: An Algorithmically Generated Biomedical Knowledge Graph [4.030892610300306]
We introduce the Biomedical Informatics Ontology System (BIOS), the first large scale publicly available BioMedKG that is fully generated by machine learning algorithms.
BIOS contains 4.1 million concepts, 7.4 million terms in two languages, and 7.3 million relation triplets.
Results suggest that machine learning-based BioMedKG development is a totally viable solution for replacing traditional expert curation.
arXiv Detail & Related papers (2022-03-18T14:09:22Z) - Benchmarking for Biomedical Natural Language Processing Tasks with a
Domain Specific ALBERT [9.8215089151757]
We present BioALBERT, a domain-specific adaptation of A Lite Bidirectional Representations from Transformers (ALBERT)
It is trained on biomedical and PubMed Central and clinical corpora and fine tuned for 6 different tasks across 20 benchmark datasets.
It represents a new state of the art in 17 out of 20 benchmark datasets.
arXiv Detail & Related papers (2021-07-09T11:47:13Z) - BioALBERT: A Simple and Effective Pre-trained Language Model for
Biomedical Named Entity Recognition [9.05154470433578]
Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models.
We propose biomedical ALBERT, an effective domain-specific language model trained on large-scale biomedical corpora.
arXiv Detail & Related papers (2020-09-19T12:58:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.