Guided Generative Protein Design using Regularized Transformers
- URL: http://arxiv.org/abs/2201.09948v1
- Date: Mon, 24 Jan 2022 20:55:53 GMT
- Title: Guided Generative Protein Design using Regularized Transformers
- Authors: Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin B.
Givechian, Dhananjay Bhaskar, Smita Krishnaswamy
- Abstract summary: We introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences and predict fitness.
We explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods.
- Score: 5.425399390255931
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The development of powerful natural language models have increased the
ability to learn meaningful representations of protein sequences. In addition,
advances in high-throughput mutagenesis, directed evolution, and
next-generation sequencing have allowed for the accumulation of large amounts
of labeled fitness data. Leveraging these two trends, we introduce Regularized
Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which
is trained to jointly generate sequences as well as predict fitness. Using
ReLSO, we explicitly model the underlying sequence-function landscape of large
labeled datasets and optimize within latent space using gradient-based methods.
Through regularized prediction heads, ReLSO introduces a powerful protein
sequence encoder and novel approach for efficient fitness landscape traversal.
Related papers
- Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Robust Optimization in Protein Fitness Landscapes Using Reinforcement Learning in Latent Space [13.228932754390748]
We propose LatProtRL, an optimization method to efficiently traverse a latent space learned by an encoder-decoder leveraging a large protein language model.
To escape local optima, our optimization is modeled as a Markov decision process using reinforcement learning acting directly in latent space.
Our findings and in vitro evaluation show that the generated sequences can reach high-fitness regions, suggesting a substantial potential of LatProtRL in lab-in-the-loop scenarios.
arXiv Detail & Related papers (2024-05-29T11:03:42Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - ETDock: A Novel Equivariant Transformer for Protein-Ligand Docking [36.14826783009814]
Traditional docking methods rely on scoring functions and deep learning to predict the docking between proteins and drugs.
In this paper, we propose a transformer neural network for protein-ligand docking pose prediction.
The experimental results on real datasets show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2023-10-12T06:23:12Z) - Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image
Compression [63.56922682378755]
We focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding.
The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform.
Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
arXiv Detail & Related papers (2023-08-17T01:34:51Z) - Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for
Inverse Problem [97.64313409741614]
We prove fast mixing and characterize the stationary distribution of the Langevin Algorithm for inverting random weighted DNN generators.
We propose to do posterior sampling in the latent space of a pre-trained generative model.
arXiv Detail & Related papers (2022-06-18T03:47:37Z) - Generative power of a protein language model trained on multiple
sequence alignments [0.5639904484784126]
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families.
Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end.
We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer.
arXiv Detail & Related papers (2022-04-14T16:59:05Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z) - AdaLead: A simple and robust adaptive greedy search algorithm for
sequence design [55.41644538483948]
We develop an easy-to-directed, scalable, and robust evolutionary greedy algorithm (AdaLead)
AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.
arXiv Detail & Related papers (2020-10-05T16:40:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.