Related papers: Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

URL: http://arxiv.org/abs/2410.19471v1
Date: Fri, 25 Oct 2024 11:04:02 GMT
Title: Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization
Authors: Ryan Park, Darren J. Hsu, C. Brian Roland, Maria Korshunova, Chen Tessler, Shie Mannor, Olivia Viessmann, Bruno Trentini,
Abstract summary: Inverse folding models predict amino acid sequences that fold into desired reference structures. ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure. But when applied to peptides, these models are prone to generating repetitive sequences that do not fold into the reference structure.
Score: 33.131551374836775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inverse folding models play an important role in structure-based design by predicting amino acid sequences that fold into desired reference structures. Models like ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure. However, when applied to peptides, these models are prone to generating repetitive sequences that do not fold into the reference structure. To address this, we fine-tune ProteinMPNN to produce diverse and structurally consistent peptide sequences via Direct Preference Optimization (DPO). We derive two enhancements to DPO: online diversity regularization and domain-specific priors. Additionally, we develop a new understanding on improving diversity in decoder models. When conditioned on OpenFold generated structures, our fine-tuned models achieve state-of-the-art structural similarity scores, improving base ProteinMPNN by at least 8%. Compared to standard DPO, our regularized method achieves up to 20% higher sequence diversity with no loss in structural similarity score.

Related papers

Protein Inverse Folding From Structure Feedback [78.27854221882572]
We introduce a novel approach to fine-tune an inverse folding model using feedback from a protein folding model.<n>Our results on the CATH 4.2 test set demonstrate that DPO fine-tuning leads to a significant improvement in average TM-Score.
arXiv Detail & Related papers (2025-06-03T16:02:12Z)
Improving Protein Sequence Design through Designability Preference Optimization [22.037870784317885]
We redefine the training objective by steering sequence generation toward high designability.<n>We introduce Residue-level Designability Preference Optimization (ResiDPO)<n>This enables direct improvement in designability while preserving regions that already perform well.
arXiv Detail & Related papers (2025-05-30T23:02:51Z)
AffinityFlow: Guided Flows for Antibody Affinity Maturation [6.690846683150576]
Antibodies are widely used as therapeutics, but their development requires affinity maturation to enhance binding affinity. Recently AlphaFlow wraps AlphaFold within flow matching to generate diverse protein structures, enabling a sequence-conditioned generative model of structure. We propose an alternating optimization framework that (1) fixes the sequence to guide structure generation toward high binding affinity using a structure-based affinity predictor, then (2) applies inverse folding to create sequence mutations, refined by a sequence-based affinity predictor for post selection.
arXiv Detail & Related papers (2025-02-14T18:43:22Z)
Fast and Accurate Antibody Sequence Design via Structure Retrieval [32.38989928131971]
Igseek is a novel structure-retrieval framework that infers sequences by similar structures from a natural antibody database. Our experiments demonstrate that Igseek not only proves to be highly efficient in structural retrieval but also outperforms state-of-the-art approaches in sequence recovery for both antibodies and T-Cell Receptors.
arXiv Detail & Related papers (2025-02-11T13:29:49Z)
Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding [0.0]
inverse folding is a one-to-many problem where several sequences can fold to the same structure. We present RL-DIF, a categorical diffusion model for inverse folding that is pre-trained on sequence recovery and tuned via reinforcement learning. Experiments show RL-DIF can achieve an foldable diversity of 29% on CATH 4.2, compared to 23% from models trained on the same dataset.
arXiv Detail & Related papers (2024-10-22T16:50:34Z)
DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z)
Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation [55.93511121486321]
We introduce FoldFlow-2, a novel sequence-conditioned flow matching model for protein structure generation. We train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models.
arXiv Detail & Related papers (2024-05-30T17:53:50Z)
DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization [49.85944390503957]
DecompOpt is a structure-based molecular optimization method based on a controllable and diffusion model. We show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines.
arXiv Detail & Related papers (2024-03-07T02:53:40Z)
Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z)
AlphaFold Distillation for Protein Design [25.190210443632825]
Inverse protein folding is crucial in bio-engineering and drug discovery. Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences. We propose using knowledge distillation on folding model confidence metrics to create a faster and end-to-end differentiable distilled model.
arXiv Detail & Related papers (2022-10-05T19:43:06Z)
Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria. We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z)
Benchmarking deep generative models for diverse antibody sequence design [18.515971640245997]
Deep generative models that learn from sequences alone or from sequences and structures jointly have shown impressive performance on this task. We consider three recently proposed deep generative frameworks for protein design: (AR) the sequence-based autoregressive generative model, (GVP) the precise structure-based graph neural network, and Fold2Seq that leverages a fuzzy and scale-free representation of a three-dimensional fold. We benchmark these models on the task of computational design of antibody sequences, which demand designing sequences with high diversity for functional implication.
arXiv Detail & Related papers (2021-11-12T16:23:32Z)
EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network. Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.