Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate
Folding Landscape and Protein Structure Prediction
- URL: http://arxiv.org/abs/2208.09652v2
- Date: Sun, 8 Oct 2023 11:29:51 GMT
- Title: Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate
Folding Landscape and Protein Structure Prediction
- Authors: Jun Zhang, Sirui Liu, Mengyun Chen, Haotian Chu, Min Wang, Zidong
Wang, Jialiang Yu, Ningxi Ni, Fan Yu, Diqing Chen, Yi Isaac Yang, Boxin Xue,
Lijiang Yang, Yuan Liu and Yi Qin Gao
- Abstract summary: We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets.
By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
- Score: 28.630603355510324
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data-driven predictive methods which can efficiently and accurately transform
protein sequences into biologically active structures are highly valuable for
scientific research and medical development. Determining accurate folding
landscape using co-evolutionary information is fundamental to the success of
modern protein structure prediction methods. As the state of the art,
AlphaFold2 has dramatically raised the accuracy without performing explicit
co-evolutionary analysis. Nevertheless, its performance still shows strong
dependence on available sequence homologs. Based on the interrogation on the
cause of such dependence, we presented EvoGen, a meta generative model, to
remedy the underperformance of AlphaFold2 for poor MSA targets. By prompting
the model with calibrated or virtually generated homologue sequences, EvoGen
helps AlphaFold2 fold accurately in low-data regime and even achieve
encouraging performance with single-sequence predictions. Being able to make
accurate predictions with few-shot MSA not only generalizes AlphaFold2 better
for orphan sequences, but also democratizes its use for high-throughput
applications. Besides, EvoGen combined with AlphaFold2 yields a probabilistic
structure generation method which could explore alternative conformations of
protein sequences, and the task-aware differentiable algorithm for sequence
generation will benefit other related tasks including protein design.
Related papers
- MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training [48.398329286769304]
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families.
MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
arXiv Detail & Related papers (2024-06-08T04:23:57Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence
Alignment Generation [30.2874172276931]
We introduce MSA-Augmenter, which generates useful, novel protein sequences not currently found in databases.
Our experiments on CASP14 demonstrate that MSA-Augmenter can generate de novo sequences that retain co-evolutionary information from inferior MSAs.
arXiv Detail & Related papers (2023-06-02T14:13:50Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Unsupervised language models for disease variant prediction [3.6942566104432886]
We find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot.
We show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.
arXiv Detail & Related papers (2022-12-07T22:28:13Z) - AlphaFold Distillation for Protein Design [25.190210443632825]
Inverse protein folding is crucial in bio-engineering and drug discovery.
Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences.
We propose using knowledge distillation on folding model confidence metrics to create a faster and end-to-end differentiable distilled model.
arXiv Detail & Related papers (2022-10-05T19:43:06Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.