Steering Generative Models with Experimental Data for Protein Fitness Optimization
- URL: http://arxiv.org/abs/2505.15093v1
- Date: Wed, 21 May 2025 04:30:48 GMT
- Title: Steering Generative Models with Experimental Data for Protein Fitness Optimization
- Authors: Jason Yang, Wenda Chu, Daniel Khalil, Raul Astudillo, Bruce J. Wittmann, Frances H. Arnold, Yisong Yue,
- Abstract summary: Protein fitness optimization involves finding a sequence that maximizes desired quantitative properties in a large design space of possible sequences.<n>Recent developments in steering protein generative models (e.g. diffusion models, language models) offer a promising approach.<n>We show that plug-and-play guidance strategies offer advantages compared to alternatives such as reinforcement learning with protein language models.
- Score: 22.131533900376457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent developments in steering protein generative models (e.g diffusion models, language models) offer a promising approach. However, by and large, past studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured by low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages compared to alternatives such as reinforcement learning with protein language models.
Related papers
- Preference-Guided Diffusion for Multi-Objective Offline Optimization [64.08326521234228]
We propose a preference-guided diffusion model for offline multi-objective optimization.<n>Our guidance is a preference model trained to predict the probability that one design dominates another.<n>Our results highlight the effectiveness of classifier-guided diffusion models in generating diverse and high-quality solutions.
arXiv Detail & Related papers (2025-03-21T16:49:38Z) - A Variational Perspective on Generative Protein Fitness Optimization [14.726139539370307]
We introduce Variational Latent Generative Protein Optimization (VLGPO), a variational perspective on fitness optimization.<n>Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution.<n>VLGPO achieves state-of-the-art results on two different protein benchmarks of varying complexity.
arXiv Detail & Related papers (2025-01-31T15:07:26Z) - Large Language Model is Secretly a Protein Sequence Optimizer [24.55348363931866]
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence.<n>We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequences.
arXiv Detail & Related papers (2025-01-16T03:44:16Z) - Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization.
We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z) - Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling.
We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models.
NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z) - Robust Model-Based Optimization for Challenging Fitness Landscapes [96.63655543085258]
Protein design involves optimization on a fitness landscape.
Leading methods are challenged by sparsity of high-fitness samples in the training set.
We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools.
We propose a new approach that uses a novel VAE as its search model to overcome the problem.
arXiv Detail & Related papers (2023-05-23T03:47:32Z) - Designing Biological Sequences via Meta-Reinforcement Learning and
Bayesian Optimization [68.28697120944116]
We train an autoregressive generative model via Meta-Reinforcement Learning to propose promising sequences for selection.
We pose this problem as that of finding an optimal policy over a distribution of MDPs induced by sampling subsets of the data.
Our in-silico experiments show that meta-learning over such ensembles provides robustness against reward misspecification and achieves competitive results.
arXiv Detail & Related papers (2022-09-13T18:37:27Z) - ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution [18.726398852721204]
We propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution.
ODBO employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection.
We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding variants with properties of interest.
arXiv Detail & Related papers (2022-05-19T13:21:31Z) - Guided Generative Protein Design using Regularized Transformers [5.425399390255931]
We introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences and predict fitness.
We explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods.
arXiv Detail & Related papers (2022-01-24T20:55:53Z) - Adaptive machine learning for protein engineering [0.4568777157687961]
We discuss how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement.
First, we discuss how to select sequences through a single round of machine-learning optimization.
Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
arXiv Detail & Related papers (2021-06-10T02:56:35Z) - AdaLead: A simple and robust adaptive greedy search algorithm for
sequence design [55.41644538483948]
We develop an easy-to-directed, scalable, and robust evolutionary greedy algorithm (AdaLead)
AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.
arXiv Detail & Related papers (2020-10-05T16:40:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.