Related papers: PoET: A generative model of protein families as sequences-of-sequences

PoET: A generative model of protein families as sequences-of-sequences

URL: http://arxiv.org/abs/2306.06156v3
Date: Wed, 1 Nov 2023 12:34:47 GMT
Title: PoET: A generative model of protein families as sequences-of-sequences
Authors: Timothy F. Truong Jr, Tristan Bepler
Abstract summary: We propose a generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest. We show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all depths.
Score: 5.05828899601167
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose $\textbf{P}$r$\textbf{o}$tein $\textbf{E}$volutionary $\textbf{T}$ransformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training. In extensive experiments on deep mutational scanning datasets, we show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all MSA depths. We also demonstrate PoET's ability to controllably generate new protein sequences.

Related papers

Understanding protein function with a multimodal retrieval-augmented foundation model [4.281723404774888]
PoET-2 is a retrieval-augmented protein foundation model that incorporates in-context learning of family-specific evolutionary constraints.<n>PoET-2 achieves state-of-the-art performance on zero-shot variant effect prediction.
arXiv Detail & Related papers (2025-08-05T15:11:25Z)
An All-Atom Generative Model for Designing Protein Complexes [49.09672038729524]
APM (All-Atom Protein Generative Model) is a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch.
arXiv Detail & Related papers (2025-04-17T16:37:41Z)
SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models. It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z)
Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models. We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z)
ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions. We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z)
Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously. xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry. We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z)
Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing. We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z)
Unsupervised language models for disease variant prediction [3.6942566104432886]
We find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot. We show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.
arXiv Detail & Related papers (2022-12-07T22:28:13Z)
Generative power of a protein language model trained on multiple sequence alignments [0.5639904484784126]
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer.
arXiv Detail & Related papers (2022-04-14T16:59:05Z)
Few Shot Protein Generation [4.7210697296108926]
We present the MSA-to-protein transformer, a generative model of protein sequences conditioned on protein families represented by multiple sequence alignments (MSAs) Unlike existing approaches to learning generative models of protein families, the MSA-to-protein transformer conditions sequence generation directly on a learned encoding of the multiple sequence alignment. Our generative approach accurately models epistasis and indels and allows for exact inference and efficient sampling unlike other approaches.
arXiv Detail & Related papers (2022-04-03T22:14:02Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM) Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)
Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated. We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences. We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.