Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction
- URL: http://arxiv.org/abs/2405.06729v1
- Date: Fri, 10 May 2024 14:50:40 GMT
- Title: Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction
- Authors: Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young,
- Abstract summary: Protein Language Models (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants.
We present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS)
These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.
- Score: 3.2358123775807575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein Language Models (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants, but they still lag experimental accuracy. Here, we present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS) assays using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements in a held-out protein test set, and on independent DMS and clinical variant annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.
Related papers
- A Matrix Variational Auto-Encoder for Variant Effect Prediction in Pharmacogenes [9.316811342279955]
We propose a transformer-based matrix variational auto-encoder (matVAE) with a structured prior.<n>We evaluate its performance on 33 Deep mutational scanning (DMS) datasets corresponding to 26 drug target and ADME proteins.<n>Our model trained on MSAs (matVAE-MSA) outperforms the state-of-the-art DeepSequence model in zero-shot prediction on DMS datasets.
arXiv Detail & Related papers (2025-07-03T13:50:18Z) - PLAME: Leveraging Pretrained Language Models to Generate Enhanced Protein Multiple Sequence Alignments [53.55710514466851]
Protein structure prediction is essential for drug discovery and understanding biological functions.<n>Most folding models rely heavily on multiple sequence alignments (MSAs) to boost prediction performance.<n>We propose PLAME, a novel MSA design model that leverages evolutionary embeddings from pretrained protein language models.
arXiv Detail & Related papers (2025-06-17T04:11:30Z) - ProtCLIP: Function-Informed Protein Multi-Modal Learning [18.61302416993122]
We develop ProtCLIP, a multi-modality foundation model that represents function-aware protein embeddings.
Our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks.
The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model.
arXiv Detail & Related papers (2024-12-28T04:23:47Z) - Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model [3.4494754789770186]
Deep learning methods for protein modeling have demonstrated superior results at lower costs compared to traditional approaches.
In mutation effect prediction, the key to pre-training deep learning models lies in accurately interpreting the complex relationships among protein sequence, structure, and function.
This study introduces a retrieval-enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions.
arXiv Detail & Related papers (2024-10-28T15:28:51Z) - A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models [63.949883238901414]
We present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs.
We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.
arXiv Detail & Related papers (2024-08-29T17:46:18Z) - HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction [0.0]
HERMES is a 3D rotationally equivariant structure-based neural network model for mutational effect and stability prediction.
We present a suite of HERMES models, pre-trained with different strategies, and fine-tuned to predict the stability effect of mutations.
arXiv Detail & Related papers (2024-07-09T09:31:05Z) - Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL [1.840390797252648]
Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations.
We propose eGRAL, a novel graph neural network architecture designed for predicting binding affinity changes from amino acid substitutions in protein complexes.
eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models.
arXiv Detail & Related papers (2024-05-03T10:33:19Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - pLMFPPred: a novel approach for accurate prediction of functional
peptides integrating embedding from pre-trained protein language model and
imbalanced learning [7.5449239162950965]
pLPred is a tool for predicting functional peptides and identifying toxic peptides.
On a validated independent test set, pLPred achieved accuracy, Area under the curve - Receiver Operating Characteristics, and F1-Score values of 0.974, 0.99, and 0.974, respectively.
arXiv Detail & Related papers (2023-09-25T17:57:39Z) - CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain
Performance and Calibration [59.48235003469116]
We show that data augmentation consistently enhances OOD performance.
We also show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance.
arXiv Detail & Related papers (2023-09-14T16:16:40Z) - Efficient Prediction of Peptide Self-assembly through Sequential and
Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models.
It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z) - Using Genetic Programming to Predict and Optimize Protein Function [65.25258357832584]
We propose POET, a computational Genetic Programming tool based on evolutionary methods to enhance screening and mutagenesis in Directed Evolution.
As a proof-of-concept we use peptides that generate MRI contrast detected by the Chemical Exchange Saturation Transfer mechanism.
Our results indicate that a computational modelling tool like POET can help to find peptides with 400% better functionality than used before.
arXiv Detail & Related papers (2022-02-08T18:08:08Z) - On the Inference Calibration of Neural Machine Translation [54.48932804996506]
We study the correlation between calibration and translation performance and linguistic properties of miscalibration.
We propose a new graduated label smoothing method that can improve both inference calibration and translation performance.
arXiv Detail & Related papers (2020-05-03T02:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.