HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative
- URL: http://arxiv.org/abs/2207.13921v1
- Date: Thu, 28 Jul 2022 07:30:33 GMT
- Title: HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative
- Authors: Xiaomin Fang, Fan Wang, Lihang Liu, Jingzhou He, Dayong Lin, Yingfei
Xiang, Xiaonan Zhang, Hua Wu, Hui Li, Le Song
- Abstract summary: HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
- Score: 61.984700682903096
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: AI-based protein structure prediction pipelines, such as AlphaFold2, have
achieved near-experimental accuracy. These advanced pipelines mainly rely on
Multiple Sequence Alignments (MSAs) and templates as inputs to learn the
co-evolution information from the homologous sequences. Nonetheless, searching
MSAs and templates from protein databases is time-consuming, usually taking
dozens of minutes. Consequently, we attempt to explore the limits of fast
protein structure prediction by using only primary sequences of proteins.
HelixFold-Single is proposed to combine a large-scale protein language model
with the superior geometric learning capability of AlphaFold2. Our proposed
method, HelixFold-Single, first pre-trains a large-scale protein language model
(PLM) with thousands of millions of primary sequences utilizing the
self-supervised learning paradigm, which will be used as an alternative to MSAs
and templates for learning the co-evolution information. Then, by combining the
pre-trained PLM and the essential components of AlphaFold2, we obtain an
end-to-end differentiable model to predict the 3D coordinates of atoms from
only the primary sequence. HelixFold-Single is validated in datasets CASP14 and
CAMEO, achieving competitive accuracy with the MSA-based methods on the targets
with large homologous families. Furthermore, HelixFold-Single consumes much
less time than the mainstream pipelines for protein structure prediction,
demonstrating its potential in tasks requiring many predictions. The code of
HelixFold-Single is available at
https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single,
and we also provide stable web services on
https://paddlehelix.baidu.com/app/drug/protein-single/forecast.
Related papers
- MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training [48.398329286769304]
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families.
MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
arXiv Detail & Related papers (2024-06-08T04:23:57Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Pairing interacting protein sequences using masked language modeling [0.3222802562733787]
We develop a method to pair interacting protein sequences using protein language models trained on sequence alignments.
We exploit the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context.
We show that it captures inter-chain coevolution while it was trained on single-chain data, which means that it can be used out-of-distribution.
arXiv Detail & Related papers (2023-08-14T13:42:09Z) - Target-aware Variational Auto-encoders for Ligand Generation with
Multimodal Protein Representation Learning [2.01243755755303]
We introduce TargetVAE, a target-aware auto-encoder that generates with high binding affinities to arbitrary protein targets.
This is the first effort to unify different representations of proteins into a single model that we name as Protein Multimodal Network (PMN)
arXiv Detail & Related papers (2023-08-02T12:08:17Z) - DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models [47.73386438748902]
DiffDock-PP is a diffusion generative model that learns to translate and rotate unbound protein structures into their bound conformations.
We achieve state-of-the-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines.
arXiv Detail & Related papers (2023-04-08T02:10:44Z) - AlphaFold Distillation for Protein Design [25.190210443632825]
Inverse protein folding is crucial in bio-engineering and drug discovery.
Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences.
We propose using knowledge distillation on folding model confidence metrics to create a faster and end-to-end differentiable distilled model.
arXiv Detail & Related papers (2022-10-05T19:43:06Z) - Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate
Folding Landscape and Protein Structure Prediction [28.630603355510324]
We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets.
By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
arXiv Detail & Related papers (2022-08-20T10:23:17Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated.
We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences.
We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.