Mathematics-assisted directed evolution and protein engineering
- URL: http://arxiv.org/abs/2306.04658v1
- Date: Tue, 6 Jun 2023 19:27:11 GMT
- Title: Mathematics-assisted directed evolution and protein engineering
- Authors: Yuchi Qiu, Guo-Wei Wei
- Abstract summary: It is experimentally impossible to perform the deep mutational scanning of the entire protein library due to the enormous mutational space.
This has led to the rapid growth of AI-assisted directed evolution (AIDE) or AI-assisted protein engineering (AIPE) as an emerging research field.
We argue that a class of persistent topological Laplacians (PTLs), including persistent Laplacians, persistent path Laplacians, persistent sheaf Laplacians, persistent hypergraph Laplacians, persistent hyperdigraph Laplacians, and evolutionary de Rham-Hodge theory, can effectively overcome the limitations
- Score: 0.913755431537592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Directed evolution is a molecular biology technique that is transforming
protein engineering by creating proteins with desirable properties and
functions. However, it is experimentally impossible to perform the deep
mutational scanning of the entire protein library due to the enormous
mutational space, which scales as $20^N$ , where N is the number of amino
acids. This has led to the rapid growth of AI-assisted directed evolution
(AIDE) or AI-assisted protein engineering (AIPE) as an emerging research field.
Aided with advanced natural language processing (NLP) techniques, including
long short-term memory, autoencoder, and transformer, sequence-based embeddings
have been dominant approaches in AIDE and AIPE. Persistent Laplacians, an
emerging technique in topological data analysis (TDA), have made
structure-based embeddings a superb option in AIDE and AIPE. We argue that a
class of persistent topological Laplacians (PTLs), including persistent
Laplacians, persistent path Laplacians, persistent sheaf Laplacians, persistent
hypergraph Laplacians, persistent hyperdigraph Laplacians, and evolutionary de
Rham-Hodge theory, can effectively overcome the limitations of the current TDA
and offer a new generation of more powerful TDA approaches. In the general
framework of topological deep learning, mathematics-assisted directed evolution
(MADE) has a great potential for future protein engineering.
Related papers
- GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer [53.85265022754878]
We propose a lightweight DDG predictor (Light-DDG) for fast mutation screening.
We also release a large-scale dataset containing millions of mutation data for pre-training Light-DDG.
For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences.
arXiv Detail & Related papers (2025-02-10T09:26:57Z) - Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - Large Language Model is Secretly a Protein Sequence Optimizer [24.55348363931866]
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence.
We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequences.
arXiv Detail & Related papers (2025-01-16T03:44:16Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - Efficiently Predicting Mutational Effect on Homologous Proteins by Evolution Encoding [7.067145619709089]
EvolMPNN is an efficient model to learn evolution-aware protein embeddings.
Our model shows up to 6.4% better than state-of-the-art methods and attains 36X inference speedup.
arXiv Detail & Related papers (2024-02-20T23:06:21Z) - Evolutionary Algorithms Simulating Molecular Evolution: A New Field Proposal [0.0716879432974126]
Recent advancements in genome sequencing have unveiled a vast diversity of protein families, but compared to the massive search space of all possible amino acid sequences, the set of known functional families is minimal.
One could say nature has a limited protein "vocabulary"
By merging evolutionary algorithms, machine learning (ML), and bioinformatics, we can facilitate the development of completely novel proteins which have never existed before.
arXiv Detail & Related papers (2024-02-01T19:22:02Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Plug & Play Directed Evolution of Proteins with Gradient-based Discrete
MCMC [1.0499611180329804]
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations.
We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models.
By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins.
arXiv Detail & Related papers (2022-12-20T00:26:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.