BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets
- URL: http://arxiv.org/abs/2306.11189v1
- Date: Mon, 19 Jun 2023 22:48:18 GMT
- Title: BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets
- Authors: Po-Ting Lai, Chih-Hsuan Wei, Ling Luo, Qingyu Chen, Zhiyong Lu
- Abstract summary: Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
- Score: 7.7587371896752595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biomedical relation extraction (RE) is the task of automatically identifying
and characterizing relations between biomedical concepts from free text. RE is
a central task in biomedical natural language processing (NLP) research and
plays a critical role in many downstream applications, such as literature-based
discovery and knowledge graph construction. State-of-the-art methods were used
primarily to train machine learning models on individual RE datasets, such as
protein-protein interaction and chemical-induced disease relation. Manual
dataset annotation, however, is highly expensive and time-consuming, as it
requires domain knowledge. Existing RE datasets are usually domain-specific or
small, which limits the development of generalized and high-performing RE
models. In this work, we present a novel framework for systematically
addressing the data heterogeneity of individual datasets and combining them
into a large dataset. Based on the framework and dataset, we report on BioREx,
a data-centric approach for extracting relations. Our evaluation shows that
BioREx achieves significantly higher performance than the benchmark system
trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in
F-1 measure on the recently released BioRED corpus. We further demonstrate that
the combined dataset can improve performance for five different RE tasks. In
addition, we show that on average BioREx compares favorably to current
best-performing methods such as transfer learning and multi-task learning.
Finally, we demonstrate BioREx's robustness and generalizability in two
independent RE tasks not previously seen in training data: drug-drug N-ary
combination and document-level gene-disease RE. The integrated dataset and
optimized method have been packaged as a stand-alone tool available at
https://github.com/ncbi/BioREx.
Related papers
- Augmenting Biomedical Named Entity Recognition with General-domain Resources [47.24727904076347]
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations.
We propose GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training.
We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances.
arXiv Detail & Related papers (2024-06-15T15:28:02Z) - BioBERT-based Deep Learning and Merged ChemProt-DrugProt for Enhanced Biomedical Relation Extraction [2.524192238862961]
Our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy.
The study highlights the potential of automated information extraction in biomedical research and clinical practice.
arXiv Detail & Related papers (2024-05-28T21:34:01Z) - Extracting Protein-Protein Interactions (PPIs) from Biomedical
Literature using Attention-based Relational Context Information [5.456047952635665]
This work presents a unified, multi-source PPI corpora with vetted interaction definitions augmented by binary interaction type labels.
A Transformer-based deep learning method exploits entities' relational context information for relation representation to improve relation classification performance.
The model's performance is evaluated on four widely studied biomedical relation extraction datasets.
arXiv Detail & Related papers (2024-03-08T01:43:21Z) - UniCell: Universal Cell Nucleus Classification via Prompt Learning [76.11864242047074]
We propose a universal cell nucleus classification framework (UniCell)
It employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains.
In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets.
arXiv Detail & Related papers (2024-02-20T11:50:27Z) - Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction.
We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z) - Relation Extraction in underexplored biomedical domains: A
diversity-optimised sampling and synthetic data generation approach [0.0]
sparsity of labelled data is an obstacle to the development of Relation Extraction models.
We create the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets.
We evaluate the performance of standard fine-tuning as a generative task and few-shot learning with open Large Language Models.
arXiv Detail & Related papers (2023-11-10T19:36:00Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - BioRED: A Comprehensive Biomedical Relation Extraction Dataset [6.915371362219944]
We present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types and relation pairs.
We label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.
Our results show that while existing approaches can reach high performance on the NER task, there is much room for improvement for the RE task.
arXiv Detail & Related papers (2022-04-08T19:23:49Z) - BERT WEAVER: Using WEight AVERaging to enable lifelong learning for
transformer-based models in biomedical semantic search engines [49.75878234192369]
We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model.
We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once.
arXiv Detail & Related papers (2022-02-21T10:34:41Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.