Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model
for Protein Design
- URL: http://arxiv.org/abs/2106.13058v1
- Date: Thu, 24 Jun 2021 14:34:24 GMT
- Title: Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model
for Protein Design
- Authors: Yue Cao and Payel Das and Vijil Chenthamarakshan and Pin-Yu Chen and
Igor Melnyk and Yang Shen
- Abstract summary: We propose Fold2Seq, a novel framework for designing protein sequences conditioned on a specific target fold.
We show improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design.
The unique advantages of fold-based Fold2Seq, in comparison to a structure-based deep model and RosettaDesign, become more evident on three additional real-world challenges.
- Score: 70.27706384570723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing novel protein sequences for a desired 3D topological fold is a
fundamental yet non-trivial task in protein engineering. Challenges exist due
to the complex sequence--fold relationship, as well as the difficulties to
capture the diversity of the sequences (therefore structures and functions)
within a fold. To overcome these challenges, we propose Fold2Seq, a novel
transformer-based generative framework for designing protein sequences
conditioned on a specific target fold. To model the complex sequence--structure
relationship, Fold2Seq jointly learns a sequence embedding using a transformer
and a fold embedding from the density of secondary structural elements in 3D
voxels. On test sets with single, high-resolution and complete structure inputs
for individual folds, our experiments demonstrate improved or comparable
performance of Fold2Seq in terms of speed, coverage, and reliability for
sequence design, when compared to existing state-of-the-art methods that
include data-driven deep generative models and physics-based RosettaDesign. The
unique advantages of fold-based Fold2Seq, in comparison to a structure-based
deep model and RosettaDesign, become more evident on three additional
real-world challenges originating from low-quality, incomplete, or ambiguous
input structures. Source code and data are available at
https://github.com/IBM/fold2seq.
Related papers
- Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding [0.0]
inverse folding is a one-to-many problem where several sequences can fold to the same structure.
We present RL-DIF, a categorical diffusion model for inverse folding that is pre-trained on sequence recovery and tuned via reinforcement learning.
Experiments show RL-DIF can achieve an foldable diversity of 29% on CATH 4.2, compared to 23% from models trained on the same dataset.
arXiv Detail & Related papers (2024-10-22T16:50:34Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation [55.93511121486321]
We introduce FoldFlow-2, a novel sequence-conditioned flow matching model for protein structure generation.
We train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works.
We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models.
arXiv Detail & Related papers (2024-05-30T17:53:50Z) - ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models [65.82630283336051]
We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models.
We present a simple fix to this problem by constructing processes that fully exploit the structures, hence the name ComboStoc.
arXiv Detail & Related papers (2024-05-22T15:23:10Z) - FoldToken: Learning Protein Language via Vector Quantization and Beyond [56.19308144551836]
We introduce textbfFoldTokenizer to represent protein sequence-structure as discrete symbols.
We refer to the learned symbols as textbfFoldToken, and the sequence of FoldTokens serves as a new protein language.
arXiv Detail & Related papers (2024-02-04T12:18:51Z) - Protein Sequence and Structure Co-Design with Equivariant Translation [19.816174223173494]
Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models.
We propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state.
Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features.
All protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process.
arXiv Detail & Related papers (2022-10-17T06:00:12Z) - Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective [81.56957468529602]
We propose to model deep NRSfM from a sequence-to-sequence translation perspective.
First, we apply a shape-motion predictor to estimate the initial non-rigid shape and camera motion from a single frame.
Then we propose a context modeling module to model camera motions and complex non-rigid shapes.
arXiv Detail & Related papers (2022-04-10T17:13:52Z) - Benchmarking deep generative models for diverse antibody sequence design [18.515971640245997]
Deep generative models that learn from sequences alone or from sequences and structures jointly have shown impressive performance on this task.
We consider three recently proposed deep generative frameworks for protein design: (AR) the sequence-based autoregressive generative model, (GVP) the precise structure-based graph neural network, and Fold2Seq that leverages a fuzzy and scale-free representation of a three-dimensional fold.
We benchmark these models on the task of computational design of antibody sequences, which demand designing sequences with high diversity for functional implication.
arXiv Detail & Related papers (2021-11-12T16:23:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.