Related papers: Mathematics-assisted directed evolution and protein engineering

Mathematics-assisted directed evolution and protein engineering

URL: http://arxiv.org/abs/2306.04658v1
Date: Tue, 6 Jun 2023 19:27:11 GMT
Title: Mathematics-assisted directed evolution and protein engineering
Authors: Yuchi Qiu, Guo-Wei Wei
Abstract summary: It is experimentally impossible to perform the deep mutational scanning of the entire protein library due to the enormous mutational space. This has led to the rapid growth of AI-assisted directed evolution (AIDE) or AI-assisted protein engineering (AIPE) as an emerging research field. We argue that a class of persistent topological Laplacians (PTLs), including persistent Laplacians, persistent path Laplacians, persistent sheaf Laplacians, persistent hypergraph Laplacians, persistent hyperdigraph Laplacians, and evolutionary de Rham-Hodge theory, can effectively overcome the limitations
Score: 0.913755431537592
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Directed evolution is a molecular biology technique that is transforming protein engineering by creating proteins with desirable properties and functions. However, it is experimentally impossible to perform the deep mutational scanning of the entire protein library due to the enormous mutational space, which scales as $20^N$ , where N is the number of amino acids. This has led to the rapid growth of AI-assisted directed evolution (AIDE) or AI-assisted protein engineering (AIPE) as an emerging research field. Aided with advanced natural language processing (NLP) techniques, including long short-term memory, autoencoder, and transformer, sequence-based embeddings have been dominant approaches in AIDE and AIPE. Persistent Laplacians, an emerging technique in topological data analysis (TDA), have made structure-based embeddings a superb option in AIDE and AIPE. We argue that a class of persistent topological Laplacians (PTLs), including persistent Laplacians, persistent path Laplacians, persistent sheaf Laplacians, persistent hypergraph Laplacians, persistent hyperdigraph Laplacians, and evolutionary de Rham-Hodge theory, can effectively overcome the limitations of the current TDA and offer a new generation of more powerful TDA approaches. In the general framework of topological deep learning, mathematics-assisted directed evolution (MADE) has a great potential for future protein engineering.

Related papers

Prot42: a Novel Family of Protein Language Models for Target-aware Protein Binder Generation [3.2039076408339353]
We introduce Prot42, a novel family of Protein Language Models (pLMs) pretrained on vast amounts of unlabeled protein sequences. Remarkably, our models handle sequences up to 8,192 amino acids, significantly surpassing standard limitations. Prot42 excels in generating high-affinity protein binders and sequence-specific DNA-binding proteins.
arXiv Detail & Related papers (2025-04-06T11:43:12Z)
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence [54.14779179869007]
We highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for data generation, sharing, and analysis.
arXiv Detail & Related papers (2025-02-21T13:20:33Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer [53.85265022754878]
We propose a lightweight DDG predictor (Light-DDG) for fast mutation screening. We also release a large-scale dataset containing millions of mutation data for pre-training Light-DDG. For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences.
arXiv Detail & Related papers (2025-02-10T09:26:57Z)
Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z)
Large Language Model is Secretly a Protein Sequence Optimizer [24.55348363931866]
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequences.
arXiv Detail & Related papers (2025-01-16T03:44:16Z)
MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training [48.398329286769304]
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
arXiv Detail & Related papers (2024-06-08T04:23:57Z)
NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks. Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z)
Efficiently Predicting Mutational Effect on Homologous Proteins by Evolution Encoding [7.067145619709089]
EvolMPNN is an efficient model to learn evolution-aware protein embeddings. Our model shows up to 6.4% better than state-of-the-art methods and attains 36X inference speedup.
arXiv Detail & Related papers (2024-02-20T23:06:21Z)
Evolutionary Algorithms Simulating Molecular Evolution: A New Field Proposal [0.0716879432974126]
Recent advancements in genome sequencing have unveiled a vast diversity of protein families, but compared to the massive search space of all possible amino acid sequences, the set of known functional families is minimal. One could say nature has a limited protein "vocabulary" By merging evolutionary algorithms, machine learning (ML), and bioinformatics, we can facilitate the development of completely novel proteins which have never existed before.
arXiv Detail & Related papers (2024-02-01T19:22:02Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously. xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry. We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z)
A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling. We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z)
A Survey on Protein Representation Learning: Retrospect and Prospect [42.38007308086495]
Protein representation learning is a promising research topic for extracting informative knowledge from massive protein sequences or structures. We introduce the motivations for protein representation learning and formulate it in a general and unified framework. Next, we divide existing PRL methods into three main categories: sequence-based, structure-based, and sequence-structure co-modeling.
arXiv Detail & Related papers (2022-12-31T04:01:16Z)
Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC [1.0499611180329804]
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins.
arXiv Detail & Related papers (2022-12-20T00:26:23Z)
Learning Geometrically Disentangled Representations of Protein Folding Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein. Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution [18.726398852721204]
We propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution. ODBO employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding variants with properties of interest.
arXiv Detail & Related papers (2022-05-19T13:21:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.