Directed Evolution of Proteins via Bayesian Optimization in Embedding Space
- URL: http://arxiv.org/abs/2509.04998v1
- Date: Fri, 05 Sep 2025 10:47:49 GMT
- Title: Directed Evolution of Proteins via Bayesian Optimization in Embedding Space
- Authors: Matouš Soldát, Jiří Kléma,
- Abstract summary: We present a novel method for machine-learning-assisted directed evolution of proteins.<n>It combines Bayesian optimization with informative representation of protein variants extracted from a pre-trained protein language model.<n>Our method outperforms the state-of-the-art machine-learning-assisted directed evolution methods with regression objective.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Directed evolution is an iterative laboratory process of designing proteins with improved function by iteratively synthesizing new protein variants and evaluating their desired property with expensive and time-consuming biochemical screening. Machine learning methods can help select informative or promising variants for screening to increase their quality and reduce the amount of necessary screening. In this paper, we present a novel method for machine-learning-assisted directed evolution of proteins which combines Bayesian optimization with informative representation of protein variants extracted from a pre-trained protein language model. We demonstrate that the new representation based on the sequence embeddings significantly improves the performance of Bayesian optimization yielding better results with the same number of conducted screening in total. At the same time, our method outperforms the state-of-the-art machine-learning-assisted directed evolution methods with regression objective.
Related papers
- Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search [67.15159962819979]
We propose AlphaDE, a novel framework to optimize protein sequences by harnessing the innovative paradigms of large language models.<n>First, AlphaDE fine-tunes pretrained protein language models using masked language modeling on protein sequences to activate the evolutionary plausibility for the interested protein class.<n>Second, AlphaDE introduces test-time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from fine-tuned protein language model.
arXiv Detail & Related papers (2025-11-13T03:00:52Z) - Large Language Model is Secretly a Protein Sequence Optimizer [24.55348363931866]
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence.<n>We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequences.
arXiv Detail & Related papers (2025-01-16T03:44:16Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Boosting Protein Language Models with Negative Sample Mining [20.721167029530168]
We introduce a pioneering methodology for boosting large language models in the domain of protein representation learning.
Our primary contribution lies in the refinement process for correlating the over-reliance on co-evolution knowledge.
By capitalizing on this novel approach, our technique steers the training of transformer-based models within the attention score space.
arXiv Detail & Related papers (2024-05-28T07:24:20Z) - Enhancing Protein Predictive Models via Proteins Data Augmentation: A
Benchmark and New Directions [58.819567030843025]
This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks.
We propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution.
Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA)
arXiv Detail & Related papers (2024-03-01T07:58:29Z) - Protein Sequence Design with Batch Bayesian Optimisation [0.0]
Protein sequence design is a challenging problem in protein engineering, which aims to discover novel proteins with useful biological functions.
directed evolution is a widely-used approach for protein sequence design, which mimics the evolution cycle in a laboratory environment and conducts an iterative protocol.
We propose a new method based on Batch Bayesian Optimization (Batch BO), a well-established optimization method, for protein sequence design.
arXiv Detail & Related papers (2023-03-18T14:53:20Z) - Plug & Play Directed Evolution of Proteins with Gradient-based Discrete
MCMC [1.0499611180329804]
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations.
We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models.
By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins.
arXiv Detail & Related papers (2022-12-20T00:26:23Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Using Genetic Programming to Predict and Optimize Protein Function [65.25258357832584]
We propose POET, a computational Genetic Programming tool based on evolutionary methods to enhance screening and mutagenesis in Directed Evolution.
As a proof-of-concept we use peptides that generate MRI contrast detected by the Chemical Exchange Saturation Transfer mechanism.
Our results indicate that a computational modelling tool like POET can help to find peptides with 400% better functionality than used before.
arXiv Detail & Related papers (2022-02-08T18:08:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.