Fast differentiable DNA and protein sequence optimization for molecular
design
- URL: http://arxiv.org/abs/2005.11275v2
- Date: Sun, 20 Dec 2020 22:44:01 GMT
- Title: Fast differentiable DNA and protein sequence optimization for molecular
design
- Authors: Johannes Linder and Georg Seelig
- Abstract summary: Machine learning models that accurately predict biological fitness from sequence are becoming a powerful tool for molecular design.
Here, we build on a previously proposed straight-through approximation method to optimize through discrete sequence samples.
The resulting algorithm, which we call Fast SeqPropProp, achieves up to 100-fold faster convergence compared to previous versions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Designing DNA and protein sequences with improved function has the potential
to greatly accelerate synthetic biology. Machine learning models that
accurately predict biological fitness from sequence are becoming a powerful
tool for molecular design. Activation maximization offers a simple design
strategy for differentiable models: one-hot coded sequences are first
approximated by a continuous representation which is then iteratively optimized
with respect to the predictor oracle by gradient ascent. While elegant, this
method suffers from vanishing gradients and may cause predictor pathologies
leading to poor convergence. Here, we build on a previously proposed
straight-through approximation method to optimize through discrete sequence
samples. By normalizing nucleotide logits across positions and introducing an
adaptive entropy variable, we remove bottlenecks arising from overly large or
skewed sampling parameters. The resulting algorithm, which we call Fast
SeqProp, achieves up to 100-fold faster convergence compared to previous
versions of activation maximization and finds improved fitness optima for many
applications. We demonstrate Fast SeqProp by designing DNA and protein
sequences for six deep learning predictors, including a protein structure
predictor.
Related papers
- Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization [44.356888079704156]
Protein engineering is a daunting task due to the vast sequence space of any given protein.
Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences.
We propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model.
arXiv Detail & Related papers (2024-01-08T06:33:27Z) - Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling.
We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models.
NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z) - Protein Sequence Design with Batch Bayesian Optimisation [0.0]
Protein sequence design is a challenging problem in protein engineering, which aims to discover novel proteins with useful biological functions.
directed evolution is a widely-used approach for protein sequence design, which mimics the evolution cycle in a laboratory environment and conducts an iterative protocol.
We propose a new method based on Batch Bayesian Optimization (Batch BO), a well-established optimization method, for protein sequence design.
arXiv Detail & Related papers (2023-03-18T14:53:20Z) - An Empirical Evaluation of Zeroth-Order Optimization Methods on
AI-driven Molecule Optimization [78.36413169647408]
We study the effectiveness of various ZO optimization methods for optimizing molecular objectives.
We show the advantages of ZO sign-based gradient descent (ZO-signGD)
We demonstrate the potential effectiveness of ZO optimization methods on widely used benchmark tasks from the Guacamol suite.
arXiv Detail & Related papers (2022-10-27T01:58:10Z) - Improving RNA Secondary Structure Design using Deep Reinforcement
Learning [69.63971634605797]
We propose a new benchmark of applying reinforcement learning to RNA sequence design, in which the objective function is defined to be the free energy in the sequence's secondary structure.
We show results of the ablation analysis that we do for these algorithms, as well as graphs indicating the algorithm's performance across batches.
arXiv Detail & Related papers (2021-11-05T02:54:06Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - Adaptive machine learning for protein engineering [0.4568777157687961]
We discuss how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement.
First, we discuss how to select sequences through a single round of machine-learning optimization.
Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
arXiv Detail & Related papers (2021-06-10T02:56:35Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z) - Combination of digital signal processing and assembled predictive models
facilitates the rational design of proteins [0.0]
Predicting the effect of mutations in proteins is one of the most critical challenges in protein engineering.
We use clustering, embedding, and dimensionality reduction techniques to select combinations of physicochemical properties for the encoding stage.
We then select the best performing predictive models in each set of properties and create an assembled model.
arXiv Detail & Related papers (2020-10-07T16:35:02Z) - AdaLead: A simple and robust adaptive greedy search algorithm for
sequence design [55.41644538483948]
We develop an easy-to-directed, scalable, and robust evolutionary greedy algorithm (AdaLead)
AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.
arXiv Detail & Related papers (2020-10-05T16:40:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.