PoET: A generative model of protein families as sequences-of-sequences
- URL: http://arxiv.org/abs/2306.06156v3
- Date: Wed, 1 Nov 2023 12:34:47 GMT
- Title: PoET: A generative model of protein families as sequences-of-sequences
- Authors: Timothy F. Truong Jr, Tristan Bepler
- Abstract summary: We propose a generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences.
PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest.
We show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all depths.
- Score: 5.05828899601167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative protein language models are a natural way to design new proteins
with desired functions. However, current models are either difficult to direct
to produce a protein from a specific family of interest, or must be trained on
a large multiple sequence alignment (MSA) from the specific family of interest,
making them unable to benefit from transfer learning across families. To
address this, we propose $\textbf{P}$r$\textbf{o}$tein $\textbf{E}$volutionary
$\textbf{T}$ransformer (PoET), an autoregressive generative model of whole
protein families that learns to generate sets of related proteins as
sequences-of-sequences across tens of millions of natural protein sequence
clusters. PoET can be used as a retrieval-augmented language model to generate
and score arbitrary modifications conditioned on any protein family of
interest, and can extrapolate from short context lengths to generalize well
even for small families. This is enabled by a unique Transformer layer; we
model tokens sequentially within sequences while attending between sequences
order invariantly, allowing PoET to scale to context lengths beyond those used
during training. In extensive experiments on deep mutational scanning datasets,
we show that PoET outperforms existing protein language models and evolutionary
sequence models for variant function prediction across proteins of all MSA
depths. We also demonstrate PoET's ability to controllably generate new protein
sequences.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Unsupervised language models for disease variant prediction [3.6942566104432886]
We find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot.
We show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.
arXiv Detail & Related papers (2022-12-07T22:28:13Z) - Generative power of a protein language model trained on multiple
sequence alignments [0.5639904484784126]
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families.
Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end.
We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer.
arXiv Detail & Related papers (2022-04-14T16:59:05Z) - Few Shot Protein Generation [4.7210697296108926]
We present the MSA-to-protein transformer, a generative model of protein sequences conditioned on protein families represented by multiple sequence alignments (MSAs)
Unlike existing approaches to learning generative models of protein families, the MSA-to-protein transformer conditions sequence generation directly on a learned encoding of the multiple sequence alignment.
Our generative approach accurately models epistasis and indels and allows for exact inference and efficient sampling unlike other approaches.
arXiv Detail & Related papers (2022-04-03T22:14:02Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated.
We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences.
We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.