End-to-End Models for Chemical-Protein Interaction Extraction: Better
Tokenization and Span-Based Pipeline Strategies
- URL: http://arxiv.org/abs/2304.01344v1
- Date: Mon, 3 Apr 2023 20:20:22 GMT
- Title: End-to-End Models for Chemical-Protein Interaction Extraction: Better
Tokenization and Span-Based Pipeline Strategies
- Authors: Xuguang Ai and Ramakanth Kavuluru
- Abstract summary: We employ a span-based pipeline approach to produce a new state-of-the-art E2ERE performance on the ChemProt dataset.
Our results indicate that a straightforward fine-grained tokenization scheme helps span-based approaches excel in E2ERE.
- Score: 1.782718930156674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end relation extraction (E2ERE) is an important task in information
extraction, more so for biomedicine as scientific literature continues to grow
exponentially. E2ERE typically involves identifying entities (or named entity
recognition (NER)) and associated relations, while most RE tasks simply assume
that the entities are provided upfront and end up performing relation
classification. E2ERE is inherently more difficult than RE alone given the
potential snowball effect of errors from NER leading to more errors in RE. A
complex dataset in biomedical E2ERE is the ChemProt dataset (BioCreative VI,
2017) that identifies relations between chemical compounds and genes/proteins
in scientific literature. ChemProt is included in all recent biomedical natural
language processing benchmarks including BLUE, BLURB, and BigBio. However, its
treatment in these benchmarks and in other separate efforts is typically not
end-to-end, with few exceptions. In this effort, we employ a span-based
pipeline approach to produce a new state-of-the-art E2ERE performance on the
ChemProt dataset, resulting in $> 4\%$ improvement in F1-score over the prior
best effort. Our results indicate that a straightforward fine-grained
tokenization scheme helps span-based approaches excel in E2ERE, especially with
regards to handling complex named entities. Our error analysis also identifies
a few key failure modes in E2ERE for ChemProt.
Related papers
- ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - Biomedical Entity Linking as Multiple Choice Question Answering [48.74212158495695]
We present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering.
We first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity.
To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and the input with retrieved instances for the generator.
arXiv Detail & Related papers (2024-02-23T08:40:38Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - Chem-FINESE: Validating Fine-Grained Few-shot Entity Extraction through Text Reconstruction [68.76468780148734]
Fine-grained few-shot entity extraction in the chemical domain faces two unique challenges.
Chem-FINESE has two components: a seq2seq entity extractor and a seq2seq self-validation module.
Our newly proposed framework has contributed up to 8.26% and 6.84% absolute F1-score gains respectively.
arXiv Detail & Related papers (2024-01-18T18:20:15Z) - Comparison of pipeline, sequence-to-sequence, and GPT models for
end-to-end relation extraction: experiments with the rare disease use-case [2.9013777655907056]
End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine.
We compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases.
We find that pipeline models are still the best, while sequence-to-sequence models are not far behind.
arXiv Detail & Related papers (2023-11-22T22:52:00Z) - Relation Extraction in underexplored biomedical domains: A
diversity-optimised sampling and synthetic data generation approach [0.0]
sparsity of labelled data is an obstacle to the development of Relation Extraction models.
We create the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets.
We evaluate the performance of standard fine-tuning as a generative task and few-shot learning with open Large Language Models.
arXiv Detail & Related papers (2023-11-10T19:36:00Z) - BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z) - BioRED: A Comprehensive Biomedical Relation Extraction Dataset [6.915371362219944]
We present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types and relation pairs.
We label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.
Our results show that while existing approaches can reach high performance on the NER task, there is much room for improvement for the RE task.
arXiv Detail & Related papers (2022-04-08T19:23:49Z) - Discovering Drug-Target Interaction Knowledge from Biomedical Literature [107.98712673387031]
The Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications.
As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from literature becomes an urgent demand in the industry.
We explore the first end-to-end solution for this task by using generative approaches.
We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations.
arXiv Detail & Related papers (2021-09-27T17:00:14Z) - Federated Learning of Molecular Properties in a Heterogeneous Setting [79.00211946597845]
We introduce federated heterogeneous molecular learning to address these challenges.
Federated learning allows end-users to build a global model collaboratively while preserving the training data distributed over isolated clients.
FedChem should enable a new type of collaboration for improving AI in chemistry that mitigates concerns about valuable chemical data.
arXiv Detail & Related papers (2021-09-15T12:49:13Z) - BERT-GT: Cross-sentence n-ary relation extraction with BERT and Graph
Transformer [7.262905275276971]
We propose a novel architecture that combines Bidirectional Representations from Transformers with Graph Transformer (BERT-GT)
Unlike the original Transformer architecture, which utilizes the whole sentence(s) to calculate the attention of the current token, the neighbor-attention mechanism in our method calculates its attention utilizing only its neighbor tokens.
Our results show improvements of 5.44% and 3.89% in accuracy and F1-measure over the state-of-the-art on n-proteinary and chemical-proteinary datasets.
arXiv Detail & Related papers (2021-01-11T19:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.