Text Mining Drug/Chemical-Protein Interactions using an Ensemble of BERT
and T5 Based Models
- URL: http://arxiv.org/abs/2111.15617v1
- Date: Tue, 30 Nov 2021 18:14:06 GMT
- Title: Text Mining Drug/Chemical-Protein Interactions using an Ensemble of BERT
and T5 Based Models
- Authors: Virginia Adams, Hoo-Chang Shin, Carol Anderson, Bo Liu, Anas Abidin
- Abstract summary: In Track-1 of the BioCreative VII Challenge participants are asked to identify interactions between drugs/chemicals and proteins.
We attempt both a BERT-based sentence classification approach, and a more novel text-to-text approach using a T5 model.
- Score: 3.7462395049372894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Track-1 of the BioCreative VII Challenge participants are asked to
identify interactions between drugs/chemicals and proteins. In-context named
entity annotations for each drug/chemical and protein are provided and one of
fourteen different interactions must be automatically predicted. For this
relation extraction task, we attempt both a BERT-based sentence classification
approach, and a more novel text-to-text approach using a T5 model. We find that
larger BERT-based models perform better in general, with our BioMegatron-based
model achieving the highest scores across all metrics, achieving 0.74 F1 score.
Though our novel T5 text-to-text method did not perform as well as most of our
BERT-based models, it outperformed those trained on similar data, showing
promising results, achieving 0.65 F1 score. We believe a text-to-text approach
to relation extraction has some competitive advantages and there is a lot of
room for research advancement.
Related papers
- Enhancing Authorship Attribution through Embedding Fusion: A Novel Approach with Masked and Encoder-Decoder Language Models [0.0]
We propose a novel framework with textual embeddings from Pre-trained Language Models to distinguish AI-generated and human-authored text.
Our approach utilizes Embedding Fusion to integrate semantic information from multiple Language Models, harnessing their complementary strengths to enhance performance.
arXiv Detail & Related papers (2024-11-01T07:18:27Z) - Using LLMs to label medical papers according to the CIViC evidence model [0.0]
We introduce the sequence classification problem CIViC Evidence to the field of medical NLP.
We fine-tune pretrained checkpoints of BERT and RoBERTa on the CIViC Evidence dataset.
We compare the aforementioned BERT-like models to OpenAI's GPT-4 in a few-shot setting.
arXiv Detail & Related papers (2024-07-05T12:30:01Z) - Multi-objective Representation for Numbers in Clinical Narratives Using CamemBERT-bio [0.9208007322096533]
This research aims to classify numerical values extracted from medical documents across seven physiological categories.
We introduce two main innovations: integrating keyword embeddings into the model and adopting a number-agnostic strategy.
We show substantial improvements in the effectiveness of CamemBERT-bio, surpassing conventional methods with an F1 score of 0.89.
arXiv Detail & Related papers (2024-05-28T01:15:21Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z) - Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for
Text-to-SQL Parsing [56.232873134174056]
One of the major challenges in text-to-text parsing is domain generalization, i.e., how to well generalize to unseen databases.
In this work, we explore ways to further augment the pre-trained text-to-text transformer model with specialized components for text-to-text parsing.
To this end, we propose a new architecture GRAPHIX-T5, augmented by some specially-designed graph-aware model with layers.
arXiv Detail & Related papers (2023-01-18T13:29:05Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - Text Embeddings by Weakly-Supervised Contrastive Pre-training [98.31785569325402]
E5 is a family of state-of-the-art text embeddings that transfer well to a wide range of tasks.
E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts.
arXiv Detail & Related papers (2022-12-07T09:25:54Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - CU-UD: text-mining drug and chemical-protein interactions with ensembles
of BERT-based models [12.08949974675794]
BioCreative VII track 1 DrugProt task aims to promote the development and evaluation of systems that can automatically detect relations between chemical compounds/drugs and genes/proteins in PubMed abstracts.
We describe our submission, which is an ensemble system, including multiple BERT-based language models.
Our system obtained 0.7708 in precision and 0.7770 in recall, for an F1 score of 0.7739, demonstrating the effectiveness of using ensembles of BERT-based language models for automatically detecting relations between chemicals and proteins.
arXiv Detail & Related papers (2021-11-11T13:55:21Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.