Related papers: MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

URL: http://arxiv.org/abs/2406.05347v3
Date: Mon, 28 Oct 2024 08:51:54 GMT
Title: MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training
Authors: Bo Chen, Zhilei Bei, Xingyi Cheng, Pan Li, Jie Tang, Le Song,
Abstract summary: Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
Score: 48.398329286769304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in comprehensively capturing the intricate coevolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model complex evolutionary patterns. Endowed by this, its flexible 1D MSA decoding framework facilitates zero or few shot learning. Moreover, we demonstrate that leveraging the feedback from AlphaFold2 can further enhance the model capacity via Rejective Fine tuning (RFT) and Reinforcement Learning from AF2 Feedback (RLAF). Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy. The transfer learning capabilities also highlight its great potential for facilitating other protein tasks.

Related papers

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model [92.51919604882984]
We introduce AMix-1, a powerful protein foundation model built on Flow Bayesian Networks.<n>AMix-1 is empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm.<n>Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework.
arXiv Detail & Related papers (2025-07-11T17:02:25Z)
PLAME: Leveraging Pretrained Language Models to Generate Enhanced Protein Multiple Sequence Alignments [53.55710514466851]
Protein structure prediction is essential for drug discovery and understanding biological functions.<n>Most folding models rely heavily on multiple sequence alignments (MSAs) to boost prediction performance.<n>We propose PLAME, a novel MSA design model that leverages evolutionary embeddings from pretrained protein language models.
arXiv Detail & Related papers (2025-06-17T04:11:30Z)
Bidirectional Hierarchical Protein Multi-Modal Representation Learning [4.682021474006426]
Protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features.
arXiv Detail & Related papers (2025-04-07T06:47:49Z)
Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding [43.811432723460534]
We introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins. We construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model.
arXiv Detail & Related papers (2024-10-04T16:02:50Z)
Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously. xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Progressive Multi-Modality Learning for Inverse Protein Folding [47.095862120116976]
We propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning. MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge. Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks.
arXiv Detail & Related papers (2023-12-11T10:59:23Z)
Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation [30.2874172276931]
We introduce MSA-Augmenter, which generates useful, novel protein sequences not currently found in databases. Our experiments on CASP14 demonstrate that MSA-Augmenter can generate de novo sequences that retain co-evolutionary information from inferior MSAs.
arXiv Detail & Related papers (2023-06-02T14:13:50Z)
Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing. We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z)
Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs) We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z)
Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate Folding Landscape and Protein Structure Prediction [28.630603355510324]
We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets. By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
arXiv Detail & Related papers (2022-08-20T10:23:17Z)
HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences. We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z)
EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network. Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.