Related papers: Diffusion on language model embeddings for protein sequence generation

Diffusion on language model embeddings for protein sequence generation

URL: http://arxiv.org/abs/2403.03726v1
Date: Wed, 6 Mar 2024 14:15:20 GMT
Title: Diffusion on language model embeddings for protein sequence generation
Authors: Viacheslav Meshchaninov, Pavel Strashnov, Andrey Shevtsov, Fedor Nikolaev, Nikita Ivanisenko, Olga Kardymon, Dmitry Vetrov
Abstract summary: We introduce DiMA, a model that leverages continuous diffusion to generate amino acid sequences. We quantitatively illustrate the impact of the design choices that lead to its superior performance. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space.
Score: 0.5442686600296733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived from the protein language model, ESM-2, to generate amino acid sequences. DiMA surpasses leading solutions, including autoregressive transformer-based and discrete diffusion models, and we quantitatively illustrate the impact of the design choices that lead to its superior performance. We extensively evaluate the quality, diversity, distribution similarity, and biological relevance of the generated sequences using multiple metrics across various modalities. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. This work advances the field of protein design and sets the stage for conditional models by providing a robust framework for scalable and high-quality protein sequence generation.

Related papers

Demystify Protein Generation with Hierarchical Conditional Diffusion Models [17.174551222714722]
We propose a novel conditional diffusion model for efficient end-to-end protein design guided by specified functions.<n>By generating representations at different levels simultaneously, our framework can effectively model the inherent hierarchical relations between different levels.<n>We also propose a Protein-MMD, a new reliable evaluation metric, to evaluate the quality of generated protein.
arXiv Detail & Related papers (2025-07-24T17:34:02Z)
Bidirectional Hierarchical Protein Multi-Modal Representation Learning [4.682021474006426]
Protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features.
arXiv Detail & Related papers (2025-04-07T06:47:49Z)
OneProt: Towards Multi-Modal Protein Foundation Models [5.440531199006399]
We introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data. It surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction. This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.
arXiv Detail & Related papers (2024-11-07T16:54:54Z)
SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models. It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z)
Structure Language Models for Protein Conformation Generation [66.42864253026053]
Traditional physics-based simulation methods often struggle with sampling equilibrium conformations. Deep generative models have shown promise in generating protein conformations as a more efficient alternative. We introduce Structure Language Modeling as a novel framework for efficient protein conformation generation.
arXiv Detail & Related papers (2024-10-24T03:38:51Z)
DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z)
Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation [55.93511121486321]
We introduce FoldFlow-2, a novel sequence-conditioned flow matching model for protein structure generation. We train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models.
arXiv Detail & Related papers (2024-05-30T17:53:50Z)
Protein Conformation Generation via Force-Guided SE(3) Diffusion Models [48.48934625235448]
Deep generative modeling techniques have been employed to generate novel protein conformations. We propose a force-guided SE(3) diffusion model, ConfDiff, for protein conformation generation.
arXiv Detail & Related papers (2024-03-21T02:44:08Z)
Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding [90.77521413857448]
Deep generative models are anchored in three core capabilities -- generating new instances, reconstructing inputs, and learning compact representations. We introduce Generalized generative adversarial-Decoding Diffusion Probabilistic Models (EDDPMs) EDDPMs generalize the Gaussian noising-denoising in standard diffusion by introducing parameterized encoding-decoding. Experiments on text, proteins, and images demonstrate the flexibility to handle diverse data and tasks.
arXiv Detail & Related papers (2024-02-29T10:08:57Z)
Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z)
Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z)
A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling. We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z)
Protein Sequence and Structure Co-Design with Equivariant Translation [19.816174223173494]
Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models. We propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features. All protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process.
arXiv Detail & Related papers (2022-10-17T06:00:12Z)
Learning Geometrically Disentangled Representations of Protein Folding Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein. Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
Few Shot Protein Generation [4.7210697296108926]
We present the MSA-to-protein transformer, a generative model of protein sequences conditioned on protein families represented by multiple sequence alignments (MSAs) Unlike existing approaches to learning generative models of protein families, the MSA-to-protein transformer conditions sequence generation directly on a learned encoding of the multiple sequence alignment. Our generative approach accurately models epistasis and indels and allows for exact inference and efficient sampling unlike other approaches.
arXiv Detail & Related papers (2022-04-03T22:14:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.