Improving few-shot learning-based protein engineering with evolutionary
sampling
- URL: http://arxiv.org/abs/2305.15441v2
- Date: Fri, 26 May 2023 00:53:37 GMT
- Title: Improving few-shot learning-based protein engineering with evolutionary
sampling
- Authors: M. Zaki Jawaid and Robin W. Yeo and Aayushma Gautam and T. Blair
Gainous and Daniel O. Hart and Timothy P. Daley
- Abstract summary: We propose a few-shot learning approach to novel protein design that aims to accelerate the expensive wet lab testing cycle.
Our approach is composed of two parts: a semi-supervised transfer learning approach to generate a discrete fitness landscape for a desired protein function and a novel evolutionary Monte Carlo Chain sampling algorithm.
We demonstrate the performance of our approach by experimentally screening predicted high fitness gene activators, resulting in a dramatically improved hit rate compared to existing methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Designing novel functional proteins remains a slow and expensive process due
to a variety of protein engineering challenges; in particular, the number of
protein variants that can be experimentally tested in a given assay pales in
comparison to the vastness of the overall sequence space, resulting in low hit
rates and expensive wet lab testing cycles. In this paper, we propose a
few-shot learning approach to novel protein design that aims to accelerate the
expensive wet lab testing cycle and is capable of leveraging a training dataset
that is both small and skewed ($\approx 10^5$ datapoints, $< 1\%$ positive
hits). Our approach is composed of two parts: a semi-supervised transfer
learning approach to generate a discrete fitness landscape for a desired
protein function and a novel evolutionary Monte Carlo Markov Chain sampling
algorithm to more efficiently explore the fitness landscape. We demonstrate the
performance of our approach by experimentally screening predicted high fitness
gene activators, resulting in a dramatically improved hit rate compared to
existing methods. Our method can be easily adapted to other protein engineering
and design problems, particularly where the cost associated with obtaining
labeled data is significantly high. We have provided open source code for our
method at https://
github.com/SuperSecretBioTech/evolutionary_monte_carlo_search.
Related papers
- Robust Optimization in Protein Fitness Landscapes Using Reinforcement Learning in Latent Space [13.228932754390748]
We propose LatProtRL, an optimization method to efficiently traverse a latent space learned by an encoder-decoder leveraging a large protein language model.
To escape local optima, our optimization is modeled as a Markov decision process using reinforcement learning acting directly in latent space.
Our findings and in vitro evaluation show that the generated sequences can reach high-fitness regions, suggesting a substantial potential of LatProtRL in lab-in-the-loop scenarios.
arXiv Detail & Related papers (2024-05-29T11:03:42Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - Enhancing Protein Predictive Models via Proteins Data Augmentation: A
Benchmark and New Directions [58.819567030843025]
This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks.
We propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution.
Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA)
arXiv Detail & Related papers (2024-03-01T07:58:29Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Accurate and Definite Mutational Effect Prediction with Lightweight
Equivariant Graph Neural Networks [2.381587712372268]
This research introduces a lightweight graph representation learning scheme that efficiently analyzes the microenvironment of wild-type proteins.
Our solution offers a wide range of benefits that make it an ideal choice for the community.
arXiv Detail & Related papers (2023-04-13T09:51:49Z) - Protein Sequence Design with Batch Bayesian Optimisation [0.0]
Protein sequence design is a challenging problem in protein engineering, which aims to discover novel proteins with useful biological functions.
directed evolution is a widely-used approach for protein sequence design, which mimics the evolution cycle in a laboratory environment and conducts an iterative protocol.
We propose a new method based on Batch Bayesian Optimization (Batch BO), a well-established optimization method, for protein sequence design.
arXiv Detail & Related papers (2023-03-18T14:53:20Z) - Plug & Play Directed Evolution of Proteins with Gradient-based Discrete
MCMC [1.0499611180329804]
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations.
We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models.
By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins.
arXiv Detail & Related papers (2022-12-20T00:26:23Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution [18.726398852721204]
We propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution.
ODBO employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection.
We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding variants with properties of interest.
arXiv Detail & Related papers (2022-05-19T13:21:31Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.