Related papers: HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

URL: http://arxiv.org/abs/2207.13921v1
Date: Thu, 28 Jul 2022 07:30:33 GMT
Title: HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative
Authors: Xiaomin Fang, Fan Wang, Lihang Liu, Jingzhou He, Dayong Lin, Yingfei Xiang, Xiaonan Zhang, Hua Wu, Hui Li, Le Song
Abstract summary: HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences. We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
Score: 61.984700682903096
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: AI-based protein structure prediction pipelines, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on Multiple Sequence Alignments (MSAs) and templates as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs and templates from protein databases is time-consuming, usually taking dozens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary sequences of proteins. HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method, HelixFold-Single, first pre-trains a large-scale protein language model (PLM) with thousands of millions of primary sequences utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs and templates for learning the co-evolution information. Then, by combining the pre-trained PLM and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence. HelixFold-Single is validated in datasets CASP14 and CAMEO, achieving competitive accuracy with the MSA-based methods on the targets with large homologous families. Furthermore, HelixFold-Single consumes much less time than the mainstream pipelines for protein structure prediction, demonstrating its potential in tasks requiring many predictions. The code of HelixFold-Single is available at https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein-single/forecast.

Related papers

Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction [28.150437140009025]
Protriever is an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task.<n>We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task.
arXiv Detail & Related papers (2025-06-10T16:24:09Z)
FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction [3.8366697175402225]
FlowDock is the first deep geometric generative model that learns to map unbound (apo) structures to their bound (holo) counterparts. FlowDock provides predicted structural confidence scores and binding affinity values with each of its generated protein-ligand complex structures.
arXiv Detail & Related papers (2024-12-14T20:54:37Z)
MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training [48.398329286769304]
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
arXiv Detail & Related papers (2024-06-08T04:23:57Z)
Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously. xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Pairing interacting protein sequences using masked language modeling [0.3222802562733787]
We develop a method to pair interacting protein sequences using protein language models trained on sequence alignments. We exploit the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. We show that it captures inter-chain coevolution while it was trained on single-chain data, which means that it can be used out-of-distribution.
arXiv Detail & Related papers (2023-08-14T13:42:09Z)
Target-aware Variational Auto-encoders for Ligand Generation with Multimodal Protein Representation Learning [2.01243755755303]
We introduce TargetVAE, a target-aware auto-encoder that generates with high binding affinities to arbitrary protein targets. This is the first effort to unify different representations of proteins into a single model that we name as Protein Multimodal Network (PMN)
arXiv Detail & Related papers (2023-08-02T12:08:17Z)
DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models [47.73386438748902]
DiffDock-PP is a diffusion generative model that learns to translate and rotate unbound protein structures into their bound conformations. We achieve state-of-the-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines.
arXiv Detail & Related papers (2023-04-08T02:10:44Z)
AlphaFold Distillation for Protein Design [25.190210443632825]
Inverse protein folding is crucial in bio-engineering and drug discovery. Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences. We propose using knowledge distillation on folding model confidence metrics to create a faster and end-to-end differentiable distilled model.
arXiv Detail & Related papers (2022-10-05T19:43:06Z)
Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate Folding Landscape and Protein Structure Prediction [28.630603355510324]
We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets. By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
arXiv Detail & Related papers (2022-08-20T10:23:17Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM) Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)
Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated. We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences. We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.