AlphaFold Distillation for Protein Design
- URL: http://arxiv.org/abs/2210.03488v2
- Date: Wed, 22 Nov 2023 22:52:47 GMT
- Title: AlphaFold Distillation for Protein Design
- Authors: Igor Melnyk, Aurelie Lozano, Payel Das, Vijil Chenthamarakshan
- Abstract summary: Inverse protein folding is crucial in bio-engineering and drug discovery.
Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences.
We propose using knowledge distillation on folding model confidence metrics to create a faster and end-to-end differentiable distilled model.
- Score: 25.190210443632825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inverse protein folding, the process of designing sequences that fold into a
specific 3D structure, is crucial in bio-engineering and drug discovery.
Traditional methods rely on experimentally resolved structures, but these cover
only a small fraction of protein sequences. Forward folding models like
AlphaFold offer a potential solution by accurately predicting structures from
sequences. However, these models are too slow for integration into the
optimization loop of inverse folding models during training. To address this,
we propose using knowledge distillation on folding model confidence metrics,
such as pTM or pLDDT scores, to create a faster and end-to-end differentiable
distilled model. This model can then be used as a structure consistency
regularizer in training the inverse folding model. Our technique is versatile
and can be applied to other design tasks, such as sequence-based protein
infilling. Experimental results show that our method outperforms
non-regularized baselines, yielding up to 3% improvement in sequence recovery
and up to 45% improvement in protein diversity while maintaining structural
consistency in generated sequences. Code is available at
https://github.com/IBM/AFDistill
Related papers
- Improving AlphaFlow for Efficient Protein Ensembles Generation [64.10918970280603]
We propose a feature-conditioned generative model called AlphaFlow-Lit to realize efficient protein ensembles generation.
AlphaFlow-Lit performs on-par with AlphaFlow and surpasses its distilled version without pretraining, all while achieving a significant sampling acceleration of around 47 times.
arXiv Detail & Related papers (2024-07-08T13:36:43Z) - AlphaFold Meets Flow Matching for Generating Protein Ensembles [12.54715076335456]
We develop a flow-based generative modeling approach for learning and sampling the conformational landscapes of proteins.
Our method provides a superior combination of precision and diversity compared to AlphaFold with MSA subsampling.
Our method can diversify a static PDB structure with faster wall-clock convergence to certain equilibrium properties than replicate MD trajectories.
arXiv Detail & Related papers (2024-02-07T13:44:47Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Ophiuchus: Scalable Modeling of Protein Structures through Hierarchical
Coarse-graining SO(3)-Equivariant Autoencoders [1.8835495377767553]
Three-dimensional native states of natural proteins display recurring and hierarchical patterns.
Traditional graph-based modeling of protein structures is often limited to operate within a single fine-grained resolution.
We introduce Ophiuchus, an SO(3)-equivariant coarse-graining model that efficiently operates on all-atom protein structures.
arXiv Detail & Related papers (2023-10-04T01:01:11Z) - Predicting protein variants with equivariant graph neural networks [0.0]
We compare the abilities of equivariant graph neural networks (EGNNs) and sequence-based approaches to identify promising amino-acid mutations.
Our proposed structural approach achieves a competitive performance to sequence-based approaches while being trained on significantly fewer molecules.
arXiv Detail & Related papers (2023-06-21T12:44:52Z) - Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling.
We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models.
NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate
Folding Landscape and Protein Structure Prediction [28.630603355510324]
We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets.
By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
arXiv Detail & Related papers (2022-08-20T10:23:17Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.