Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding
- URL: http://arxiv.org/abs/2410.17173v1
- Date: Tue, 22 Oct 2024 16:50:34 GMT
- Title: Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding
- Authors: Yasha Ektefaie, Olivia Viessmann, Siddharth Narayanan, Drew Dresser, J. Mark Kim, Armen Mkrtchyan,
- Abstract summary: inverse folding is a one-to-many problem where several sequences can fold to the same structure.
We present RL-DIF, a categorical diffusion model for inverse folding that is pre-trained on sequence recovery and tuned via reinforcement learning.
Experiments show RL-DIF can achieve an foldable diversity of 29% on CATH 4.2, compared to 23% from models trained on the same dataset.
- Score: 0.0
- License:
- Abstract: Protein inverse folding-that is, predicting an amino acid sequence that will fold into the desired 3D structure-is an important problem for structure-based protein design. Machine learning based methods for inverse folding typically use recovery of the original sequence as the optimization objective. However, inverse folding is a one-to-many problem where several sequences can fold to the same structure. Moreover, for many practical applications, it is often desirable to have multiple, diverse sequences that fold into the target structure since it allows for more candidate sequences for downstream optimizations. Here, we demonstrate that although recent inverse folding methods show increased sequence recovery, their "foldable diversity"-i.e. their ability to generate multiple non-similar sequences that fold into the structures consistent with the target-does not increase. To address this, we present RL-DIF, a categorical diffusion model for inverse folding that is pre-trained on sequence recovery and tuned via reinforcement learning on structural consistency. We find that RL-DIF achieves comparable sequence recovery and structural consistency to benchmark models but shows greater foldable diversity: experiments show RL-DIF can achieve an foldable diversity of 29% on CATH 4.2, compared to 23% from models trained on the same dataset. The PyTorch model weights and sampling code are available on GitHub.
Related papers
- Bridge-IF: Learning Inverse Protein Folding with Markov Bridges [32.53142367116483]
Inverse protein folding is a fundamental task in computational protein design.
We propose Bridge-IF, a generative diffusion bridge model for inverse folding.
We show that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability.
arXiv Detail & Related papers (2024-11-04T14:35:14Z) - Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization [33.131551374836775]
Inverse folding models predict amino acid sequences that fold into desired reference structures.
ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure.
But when applied to peptides, these models are prone to generating repetitive sequences that do not fold into the reference structure.
arXiv Detail & Related papers (2024-10-25T11:04:02Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation [55.93511121486321]
We introduce FoldFlow-2, a novel sequence-conditioned flow matching model for protein structure generation.
We train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works.
We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models.
arXiv Detail & Related papers (2024-05-30T17:53:50Z) - Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling.
We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models.
NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - AlphaFold Distillation for Protein Design [25.190210443632825]
Inverse protein folding is crucial in bio-engineering and drug discovery.
Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences.
We propose using knowledge distillation on folding model confidence metrics to create a faster and end-to-end differentiable distilled model.
arXiv Detail & Related papers (2022-10-05T19:43:06Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Benchmarking deep generative models for diverse antibody sequence design [18.515971640245997]
Deep generative models that learn from sequences alone or from sequences and structures jointly have shown impressive performance on this task.
We consider three recently proposed deep generative frameworks for protein design: (AR) the sequence-based autoregressive generative model, (GVP) the precise structure-based graph neural network, and Fold2Seq that leverages a fuzzy and scale-free representation of a three-dimensional fold.
We benchmark these models on the task of computational design of antibody sequences, which demand designing sequences with high diversity for functional implication.
arXiv Detail & Related papers (2021-11-12T16:23:32Z) - Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model
for Protein Design [70.27706384570723]
We propose Fold2Seq, a novel framework for designing protein sequences conditioned on a specific target fold.
We show improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design.
The unique advantages of fold-based Fold2Seq, in comparison to a structure-based deep model and RosettaDesign, become more evident on three additional real-world challenges.
arXiv Detail & Related papers (2021-06-24T14:34:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.