AlphaFold Distillation for Protein Design
- URL: http://arxiv.org/abs/2210.03488v2
- Date: Wed, 22 Nov 2023 22:52:47 GMT
- Title: AlphaFold Distillation for Protein Design
- Authors: Igor Melnyk, Aurelie Lozano, Payel Das, Vijil Chenthamarakshan
- Abstract summary: Inverse protein folding is crucial in bio-engineering and drug discovery.
Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences.
We propose using knowledge distillation on folding model confidence metrics to create a faster and end-to-end differentiable distilled model.
- Score: 25.190210443632825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inverse protein folding, the process of designing sequences that fold into a
specific 3D structure, is crucial in bio-engineering and drug discovery.
Traditional methods rely on experimentally resolved structures, but these cover
only a small fraction of protein sequences. Forward folding models like
AlphaFold offer a potential solution by accurately predicting structures from
sequences. However, these models are too slow for integration into the
optimization loop of inverse folding models during training. To address this,
we propose using knowledge distillation on folding model confidence metrics,
such as pTM or pLDDT scores, to create a faster and end-to-end differentiable
distilled model. This model can then be used as a structure consistency
regularizer in training the inverse folding model. Our technique is versatile
and can be applied to other design tasks, such as sequence-based protein
infilling. Experimental results show that our method outperforms
non-regularized baselines, yielding up to 3% improvement in sequence recovery
and up to 45% improvement in protein diversity while maintaining structural
consistency in generated sequences. Code is available at
https://github.com/IBM/AFDistill
Related papers
- Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization [33.131551374836775]
Inverse folding models predict amino acid sequences that fold into desired reference structures.
ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure.
But when applied to peptides, these models are prone to generating repetitive sequences that do not fold into the reference structure.
arXiv Detail & Related papers (2024-10-25T11:04:02Z) - Structure Language Models for Protein Conformation Generation [66.42864253026053]
Traditional physics-based simulation methods often struggle with sampling equilibrium conformations.
Deep generative models have shown promise in generating protein conformations as a more efficient alternative.
We introduce Structure Language Modeling as a novel framework for efficient protein conformation generation.
arXiv Detail & Related papers (2024-10-24T03:38:51Z) - Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding [0.0]
inverse folding is a one-to-many problem where several sequences can fold to the same structure.
We present RL-DIF, a categorical diffusion model for inverse folding that is pre-trained on sequence recovery and tuned via reinforcement learning.
Experiments show RL-DIF can achieve an foldable diversity of 29% on CATH 4.2, compared to 23% from models trained on the same dataset.
arXiv Detail & Related papers (2024-10-22T16:50:34Z) - AlphaFold Meets Flow Matching for Generating Protein Ensembles [11.1639408863378]
We develop a flow-based generative modeling approach for learning and sampling the conformational landscapes of proteins.
Our method provides a superior combination of precision and diversity compared to AlphaFold with MSA subsampling.
Our method can diversify a static PDB structure with faster wall-clock convergence to certain equilibrium properties than replicate MD trajectories.
arXiv Detail & Related papers (2024-02-07T13:44:47Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Predicting protein variants with equivariant graph neural networks [0.0]
We compare the abilities of equivariant graph neural networks (EGNNs) and sequence-based approaches to identify promising amino-acid mutations.
Our proposed structural approach achieves a competitive performance to sequence-based approaches while being trained on significantly fewer molecules.
arXiv Detail & Related papers (2023-06-21T12:44:52Z) - Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling.
We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models.
NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate
Folding Landscape and Protein Structure Prediction [28.630603355510324]
We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets.
By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
arXiv Detail & Related papers (2022-08-20T10:23:17Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.