Related papers: Guided Generative Protein Design using Regularized Transformers

Guided Generative Protein Design using Regularized Transformers

URL: http://arxiv.org/abs/2201.09948v1
Date: Mon, 24 Jan 2022 20:55:53 GMT
Title: Guided Generative Protein Design using Regularized Transformers
Authors: Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin B. Givechian, Dhananjay Bhaskar, Smita Krishnaswamy
Abstract summary: We introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences and predict fitness. We explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods.
Score: 5.425399390255931
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences as well as predict fitness. Using ReLSO, we explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal.

Related papers

Regulatory DNA sequence Design with Reinforcement Learning [56.20290878358356]
We propose a generative approach that leverages reinforcement learning to fine-tune a pre-trained autoregressive model. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types.
arXiv Detail & Related papers (2025-03-11T02:33:33Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Large Language Model is Secretly a Protein Sequence Optimizer [24.55348363931866]
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequences.
arXiv Detail & Related papers (2025-01-16T03:44:16Z)
Heuristically Adaptive Diffusion-Model Evolutionary Strategy [1.8299322342860518]
Diffusion Models represent a significant advancement in generative modeling. Our research reveals a fundamental connection between diffusion models and evolutionary algorithms. Our framework marks a major algorithmic transition, offering increased flexibility, precision, and control in evolutionary optimization processes.
arXiv Detail & Related papers (2024-11-20T16:06:28Z)
Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z)
Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z)
ETDock: A Novel Equivariant Transformer for Protein-Ligand Docking [36.14826783009814]
Traditional docking methods rely on scoring functions and deep learning to predict the docking between proteins and drugs. In this paper, we propose a transformer neural network for protein-ligand docking pose prediction. The experimental results on real datasets show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2023-10-12T06:23:12Z)
Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression [63.56922682378755]
We focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
arXiv Detail & Related papers (2023-08-17T01:34:51Z)
Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for Inverse Problem [97.64313409741614]
We prove fast mixing and characterize the stationary distribution of the Langevin Algorithm for inverting random weighted DNN generators. We propose to do posterior sampling in the latent space of a pre-trained generative model.
arXiv Detail & Related papers (2022-06-18T03:47:37Z)
Generative power of a protein language model trained on multiple sequence alignments [0.5639904484784126]
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer.
arXiv Detail & Related papers (2022-04-14T16:59:05Z)
Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z)
EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network. Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
AdaLead: A simple and robust adaptive greedy search algorithm for sequence design [55.41644538483948]
We develop an easy-to-directed, scalable, and robust evolutionary greedy algorithm (AdaLead) AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.
arXiv Detail & Related papers (2020-10-05T16:40:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.