Diffusion on language model embeddings for protein sequence generation
- URL: http://arxiv.org/abs/2403.03726v1
- Date: Wed, 6 Mar 2024 14:15:20 GMT
- Title: Diffusion on language model embeddings for protein sequence generation
- Authors: Viacheslav Meshchaninov, Pavel Strashnov, Andrey Shevtsov, Fedor
Nikolaev, Nikita Ivanisenko, Olga Kardymon, Dmitry Vetrov
- Abstract summary: We introduce DiMA, a model that leverages continuous diffusion to generate amino acid sequences.
We quantitatively illustrate the impact of the design choices that lead to its superior performance.
Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space.
- Score: 0.5442686600296733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein design requires a deep understanding of the inherent complexities of
the protein universe. While many efforts lean towards conditional generation or
focus on specific families of proteins, the foundational task of unconditional
generation remains underexplored and undervalued. Here, we explore this pivotal
domain, introducing DiMA, a model that leverages continuous diffusion on
embeddings derived from the protein language model, ESM-2, to generate amino
acid sequences. DiMA surpasses leading solutions, including autoregressive
transformer-based and discrete diffusion models, and we quantitatively
illustrate the impact of the design choices that lead to its superior
performance. We extensively evaluate the quality, diversity, distribution
similarity, and biological relevance of the generated sequences using multiple
metrics across various modalities. Our approach consistently produces novel,
diverse protein sequences that accurately reflect the inherent structural and
functional diversity of the protein space. This work advances the field of
protein design and sets the stage for conditional models by providing a robust
framework for scalable and high-quality protein sequence generation.
Related papers
- OneProt: Towards Multi-Modal Protein Foundation Models [5.440531199006399]
We introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data.
It surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction.
This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.
arXiv Detail & Related papers (2024-11-07T16:54:54Z) - Structure Language Models for Protein Conformation Generation [66.42864253026053]
Traditional physics-based simulation methods often struggle with sampling equilibrium conformations.
Deep generative models have shown promise in generating protein conformations as a more efficient alternative.
We introduce Structure Language Modeling as a novel framework for efficient protein conformation generation.
arXiv Detail & Related papers (2024-10-24T03:38:51Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation [55.93511121486321]
We introduce FoldFlow-2, a novel sequence-conditioned flow matching model for protein structure generation.
We train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works.
We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models.
arXiv Detail & Related papers (2024-05-30T17:53:50Z) - Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding [90.77521413857448]
Deep generative models are anchored in three core capabilities -- generating new instances, reconstructing inputs, and learning compact representations.
We introduce Generalized generative adversarial-Decoding Diffusion Probabilistic Models (EDDPMs)
EDDPMs generalize the Gaussian noising-denoising in standard diffusion by introducing parameterized encoding-decoding.
Experiments on text, proteins, and images demonstrate the flexibility to handle diverse data and tasks.
arXiv Detail & Related papers (2024-02-29T10:08:57Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - Target-aware Variational Auto-encoders for Ligand Generation with
Multimodal Protein Representation Learning [2.01243755755303]
We introduce TargetVAE, a target-aware auto-encoder that generates with high binding affinities to arbitrary protein targets.
This is the first effort to unify different representations of proteins into a single model that we name as Protein Multimodal Network (PMN)
arXiv Detail & Related papers (2023-08-02T12:08:17Z) - Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling.
We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models.
NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.