SESNet: sequence-structure feature-integrated deep learning method for
data-efficient protein engineering
- URL: http://arxiv.org/abs/2301.00004v1
- Date: Thu, 29 Dec 2022 01:49:52 GMT
- Title: SESNet: sequence-structure feature-integrated deep learning method for
data-efficient protein engineering
- Authors: Mingchen Li, Liqi Kang, Yi Xiong, Yu Guang Wang, Guisheng Fan, Pan
Tan, Liang Hong
- Abstract summary: We develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants.
We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship.
Our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants.
- Score: 6.216757583450049
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning has been widely used for protein engineering. However, it is
limited by the lack of sufficient experimental data to train an accurate model
for predicting the functional fitness of high-order mutants. Here, we develop
SESNet, a supervised deep-learning model to predict the fitness for protein
mutants by leveraging both sequence and structure information, and exploiting
attention mechanism. Our model integrates local evolutionary context from
homologous sequences, the global evolutionary context encoding rich semantic
from the universal protein sequence space and the structure information
accounting for the microenvironment around each residue in a protein. We show
that SESNet outperforms state-of-the-art models for predicting the
sequence-function relationship on 26 deep mutational scanning datasets. More
importantly, we propose a data augmentation strategy by leveraging the data
from unsupervised models to pre-train our model. After that, our model can
achieve strikingly high accuracy in prediction of the fitness of protein
mutants, especially for the higher order variants (> 4 mutation sites), when
finetuned by using only a small number of experimental mutation data (<50). The
strategy proposed is of great practical value as the required experimental
effort, i.e., producing a few tens of experimental mutation data on a given
protein, is generally affordable by an ordinary biochemical group and can be
applied on almost any protein.
Related papers
- Training on test proteins improves fitness, structure, and function prediction [18.176929152066872]
Self-supervised pre-training on large datasets is a common method to enhance generalization.
We introduce a method for self-supervised fine-tuning at test time, allowing models to adapt to the test protein of interest on the fly.
We show that our method leads to new state-of-the-art results on the standard benchmark for protein fitness prediction.
arXiv Detail & Related papers (2024-11-04T14:23:59Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL [1.840390797252648]
Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations.
We propose eGRAL, a novel graph neural network architecture designed for predicting binding affinity changes from amino acid substitutions in protein complexes.
eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models.
arXiv Detail & Related papers (2024-05-03T10:33:19Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Accurate and Definite Mutational Effect Prediction with Lightweight
Equivariant Graph Neural Networks [2.381587712372268]
This research introduces a lightweight graph representation learning scheme that efficiently analyzes the microenvironment of wild-type proteins.
Our solution offers a wide range of benefits that make it an ideal choice for the community.
arXiv Detail & Related papers (2023-04-13T09:51:49Z) - Plug & Play Directed Evolution of Proteins with Gradient-based Discrete
MCMC [1.0499611180329804]
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations.
We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models.
By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins.
arXiv Detail & Related papers (2022-12-20T00:26:23Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.