Large-scale protein-protein post-translational modification extraction
with distant supervision and confidence calibrated BioBERT
- URL: http://arxiv.org/abs/2201.02229v1
- Date: Thu, 6 Jan 2022 19:59:14 GMT
- Title: Large-scale protein-protein post-translational modification extraction
with distant supervision and confidence calibrated BioBERT
- Authors: Aparna Elangovan, Yuan Li, Douglas E. V. Pires, Melissa J. Davis and
Karin Verspoor
- Abstract summary: We train an ensemble of BioBERT models - dubbed PPI-BioBERT-x10 - to improve confidence calibration.
We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter 5700 (4584 unique) high confidence predictions.
- Score: 6.1347671366134895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein-protein interactions (PPIs) are critical to normal cellular function
and are related to many disease pathways. However, only 4% of PPIs are
annotated with PTMs in biological knowledge databases such as IntAct, mainly
performed through manual curation, which is neither time nor cost-effective. We
use the IntAct PPI database to create a distant supervised dataset annotated
with interacting protein pairs, their corresponding PTM type, and associated
abstracts from the PubMed database. We train an ensemble of BioBERT models -
dubbed PPI-BioBERT-x10 to improve confidence calibration. We extend the use of
ensemble average confidence approach with confidence variation to counteract
the effects of class imbalance to extract high confidence predictions. The
PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro
41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low
variation to identify high quality predictions, tuning the predictions for
precision, we retained 19% of the test predictions with 100% precision. We
evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6
million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ~ 5700
(4584 unique) high confidence predictions. Of the 5700, human evaluation on a
small randomly sampled subset shows that the precision drops to 33.7% despite
confidence calibration and highlights the challenges of generalisability beyond
the test set even with confidence calibration. We circumvent the problem by
only including predictions associated with multiple papers, improving the
precision to 58.8%. In this work, we highlight the benefits and challenges of
deep learning-based text mining in practice, and the need for increased
emphasis on confidence calibration to facilitate human curation efforts.
Related papers
- The Probabilistic Tsetlin Machine: A Novel Approach to Uncertainty Quantification [1.0499611180329802]
This paper introduces the Probabilistic Tsetlin Machine (PTM) framework, aimed at providing a robust, reliable, and interpretable approach for uncertainty quantification.
Unlike the original TM, the PTM learns the probability of staying on each state of each Tsetlin Automaton (TA) across all clauses.
During inference, TAs decide their actions by sampling states based on learned probability distributions.
arXiv Detail & Related papers (2024-10-23T13:20:42Z) - PPINtonus: Early Detection of Parkinson's Disease Using Deep-Learning Tonal Analysis [0.0]
PPINtonus is a system for the early detection of Parkinson's Disease.
It uses deep-learning tonal analysis to provide an alternative to neurological examinations.
arXiv Detail & Related papers (2024-06-03T01:07:42Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting [55.17761802332469]
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample.
Prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications.
We propose an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples.
arXiv Detail & Related papers (2024-03-18T05:49:45Z) - Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence
Scores from Language Models Fine-Tuned with Human Feedback [91.22679548111127]
A trustworthy real-world prediction system should produce well-calibrated confidence scores.
We show that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities.
arXiv Detail & Related papers (2023-05-24T10:12:33Z) - Evaluation of GPT and BERT-based models on identifying protein-protein
interactions in biomedical text [1.3923237289777164]
Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks.
We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora.
arXiv Detail & Related papers (2023-03-30T22:06:10Z) - Sample-dependent Adaptive Temperature Scaling for Improved Calibration [95.7477042886242]
Post-hoc approach to compensate for neural networks being wrong is to perform temperature scaling.
We propose to predict a different temperature value for each input, allowing us to adjust the mismatch between confidence and accuracy.
We test our method on the ResNet50 and WideResNet28-10 architectures using the CIFAR10/100 and Tiny-ImageNet datasets.
arXiv Detail & Related papers (2022-07-13T14:13:49Z) - A Supervised Machine Learning Approach for Sequence Based
Protein-protein Interaction (PPI) Prediction [4.916874464940376]
Computational protein-protein interaction (PPI) prediction techniques can contribute greatly in reducing time, cost and false-positive interactions.
We have described our submitted solution with the results of the SeqPIP competition.
arXiv Detail & Related papers (2022-03-23T18:27:25Z) - Improving the robustness and accuracy of biomedical language models
through adversarial training [7.064032374579076]
Deep transformer neural network models have improved the predictive accuracy of intelligent text processing systems in the biomedical domain.
Neural NLP models can be easily fooled by adversarial samples, i.e. minor changes to input that preserve the meaning and understandability of the text but force the NLP system to make erroneous decisions.
This raises serious concerns about the security and trust-worthiness of biomedical NLP systems.
arXiv Detail & Related papers (2021-11-16T14:58:05Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Unlabelled Data Improves Bayesian Uncertainty Calibration under
Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation.
We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.