Evaluation of GPT and BERT-based models on identifying protein-protein
interactions in biomedical text
- URL: http://arxiv.org/abs/2303.17728v2
- Date: Wed, 13 Dec 2023 00:18:46 GMT
- Title: Evaluation of GPT and BERT-based models on identifying protein-protein
interactions in biomedical text
- Authors: Hasin Rehana, Nur Bengisu \c{C}am, Mert Basmaci, Jie Zheng,
Christianah Jemiyo, Yongqun He, Arzucan \"Ozg\"ur, Junguk Hur
- Abstract summary: Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks.
We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora.
- Score: 1.3923237289777164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting protein-protein interactions (PPIs) is crucial for understanding
genetic mechanisms, disease pathogenesis, and drug design. However, with the
fast-paced growth of biomedical literature, there is a growing need for
automated and accurate extraction of PPIs to facilitate scientific knowledge
discovery. Pre-trained language models, such as generative pre-trained
transformers (GPT) and bidirectional encoder representations from transformers
(BERT), have shown promising results in natural language processing (NLP)
tasks. We evaluated the performance of PPI identification of multiple GPT and
BERT models using three manually curated gold-standard corpora: Learning
Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference
Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance
Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best
overall performance, with BioBERT achieving the highest recall (91.95%) and
F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%).
Interestingly, despite not being explicitly trained for biomedical texts, GPT-4
achieved commendable performance, comparable to the top-performing BERT models.
It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of
86.49% on the LLL dataset. These results suggest that GPT models can
effectively detect PPIs from text data, offering promising avenues for
application in biomedical literature mining. Further research could explore how
these models might be fine-tuned for even more specialized tasks within the
biomedical domain.
Related papers
- Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision [7.275932354889042]
We introduce a protein language model tailored to generate protein sequences with distinct properties.
We rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins.
We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation.
arXiv Detail & Related papers (2024-10-25T00:15:39Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from
Literature with GPT-3 [52.59930033705221]
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
arXiv Detail & Related papers (2023-04-26T22:21:33Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - Bioformer: an efficient transformer language model for biomedical text
mining [8.961510810015643]
We present Bioformer, a compact BERT model for biomedical text mining.
We pretrained two Bioformer models which reduced the model size by 60% compared to BERTBase.
With 60% fewer parameters, Bioformer16L is only 0.1% less accurate than PubMedBERT.
arXiv Detail & Related papers (2023-02-03T08:04:59Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Large-scale protein-protein post-translational modification extraction
with distant supervision and confidence calibrated BioBERT [6.1347671366134895]
We train an ensemble of BioBERT models - dubbed PPI-BioBERT-x10 - to improve confidence calibration.
We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter 5700 (4584 unique) high confidence predictions.
arXiv Detail & Related papers (2022-01-06T19:59:14Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - BioNerFlair: biomedical named entity recognition using flair embedding
and sequence tagger [0.0]
We introduce BioNerFlair, a method to train models for biomedical named entity recognition.
With almost the same generic architecture widely used for named entity recognition, BioNerFlair outperforms previous state-of-the-art models.
arXiv Detail & Related papers (2020-11-03T06:46:45Z) - Assigning function to protein-protein interactions: a weakly supervised
BioBERT based approach using PubMed abstracts [2.208694022993555]
Protein-protein interactions (PPI) are critical to the function of proteins in both normal and diseased cells.
Only a small percentage of PPIs captured in protein interaction databases have annotations of function available.
Here, we aim to label the function type of PPIs by extracting relationships described in PubMed abstracts.
arXiv Detail & Related papers (2020-08-20T01:42:28Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.