Exploring evolution-based & -free protein language models as protein
function predictors
- URL: http://arxiv.org/abs/2206.06583v1
- Date: Tue, 14 Jun 2022 03:56:10 GMT
- Title: Exploring evolution-based & -free protein language models as protein
function predictors
- Authors: Mingyang Hu, Fajie Yuan, Kevin K. Yang, Fusong Ju, Jin Su, Hui Wang,
Fei Yang, Qiuyang Ding
- Abstract summary: Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks.
We investigate the representation ability of three popular PLMs: ESM-1b (single sequence), MSA-Transformer (multiple sequence alignment) and Evoformer (structural)
Specifically, we aim to answer the following key questions: (i) Does the Evoformer trained as part of AlphaFold produce representations amenable to predicting protein function?
We compare these models by empirical study along with new insights and conclusions.
- Score: 12.381080613343306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale Protein Language Models (PLMs) have improved performance in
protein prediction tasks, ranging from 3D structure prediction to various
function predictions. In particular, AlphaFold, a ground-breaking AI system,
could potentially reshape structural biology. However, the utility of the PLM
module in AlphaFold, Evoformer, has not been explored beyond structure
prediction. In this paper, we investigate the representation ability of three
popular PLMs: ESM-1b (single sequence), MSA-Transformer (multiple sequence
alignment) and Evoformer (structural), with a special focus on Evoformer.
Specifically, we aim to answer the following key questions: (i) Does the
Evoformer trained as part of AlphaFold produce representations amenable to
predicting protein function? (ii) If yes, can Evoformer replace ESM-1b and
MSA-Transformer? (iii) How much do these PLMs rely on evolution-related protein
data? In this regard, are they complementary to each other? We compare these
models by empirical study along with new insights and conclusions. Finally, we
release code and datasets for reproducibility.
Related papers
- Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training [48.398329286769304]
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families.
MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
arXiv Detail & Related papers (2024-06-08T04:23:57Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Beating the Best: Improving on AlphaFold2 at Protein Structure
Prediction [1.3124513975412255]
ARStack significantly outperforms AlphaFold2 and RoseTTAFold.
We rigorously demonstrate this using two sets of non-homologous proteins, and a test set of protein structures published after that of AlphaFold2 and RoseTTAFold.
arXiv Detail & Related papers (2023-01-18T14:39:34Z) - Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate
Folding Landscape and Protein Structure Prediction [28.630603355510324]
We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets.
By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
arXiv Detail & Related papers (2022-08-20T10:23:17Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - ProtTrans: Towards Cracking the Language of Life's Code Through
Self-Supervised Deep Learning and High Performance Computing [2.747785739760799]
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP.
Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids.
For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information.
arXiv Detail & Related papers (2020-07-13T07:54:20Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.