Generative power of a protein language model trained on multiple
sequence alignments
- URL: http://arxiv.org/abs/2204.07110v1
- Date: Thu, 14 Apr 2022 16:59:05 GMT
- Title: Generative power of a protein language model trained on multiple
sequence alignments
- Authors: Damiano Sgarbossa, Umberto Lupo and Anne-Florence Bitbol
- Abstract summary: Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families.
Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end.
We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer.
- Score: 0.5639904484784126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computational models starting from large ensembles of evolutionarily related
protein sequences capture a representation of protein families and learn
constraints associated to protein structure and function. They thus open the
possibility for generating novel sequences belonging to protein families.
Protein language models trained on multiple sequence alignments, such as MSA
Transformer, are highly attractive candidates to this end. We propose and test
an iterative method that directly uses the masked language modeling objective
to generate sequences using MSA Transformer. We demonstrate that the resulting
sequences generally score better than those generated by Potts models, and even
than natural sequences, for homology, coevolution and structure-based measures.
Moreover, MSA Transformer better reproduces the higher-order statistics and the
distribution of sequences in sequence space of natural data than Potts models,
although Potts models better reproduce first- and second-order statistics. MSA
Transformer is thus a strong candidate for protein sequence generation and
protein design.
Related papers
- Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation [55.93511121486321]
We introduce FoldFlow-2, a novel sequence-conditioned flow matching model for protein structure generation.
We train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works.
We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models.
arXiv Detail & Related papers (2024-05-30T17:53:50Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - FoldToken: Learning Protein Language via Vector Quantization and Beyond [56.19308144551836]
We introduce textbfFoldTokenizer to represent protein sequence-structure as discrete symbols.
We refer to the learned symbols as textbfFoldToken, and the sequence of FoldTokens serves as a new protein language.
arXiv Detail & Related papers (2024-02-04T12:18:51Z) - Pairing interacting protein sequences using masked language modeling [0.3222802562733787]
We develop a method to pair interacting protein sequences using protein language models trained on sequence alignments.
We exploit the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context.
We show that it captures inter-chain coevolution while it was trained on single-chain data, which means that it can be used out-of-distribution.
arXiv Detail & Related papers (2023-08-14T13:42:09Z) - PoET: A generative model of protein families as sequences-of-sequences [5.05828899601167]
We propose a generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences.
PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest.
We show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all depths.
arXiv Detail & Related papers (2023-06-09T16:06:36Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Few Shot Protein Generation [4.7210697296108926]
We present the MSA-to-protein transformer, a generative model of protein sequences conditioned on protein families represented by multiple sequence alignments (MSAs)
Unlike existing approaches to learning generative models of protein families, the MSA-to-protein transformer conditions sequence generation directly on a learned encoding of the multiple sequence alignment.
Our generative approach accurately models epistasis and indels and allows for exact inference and efficient sampling unlike other approaches.
arXiv Detail & Related papers (2022-04-03T22:14:02Z) - Protein language models trained on multiple sequence alignments learn
phylogenetic relationships [0.5639904484784126]
Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction.
We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs.
arXiv Detail & Related papers (2022-03-29T12:07:45Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z) - Combination of digital signal processing and assembled predictive models
facilitates the rational design of proteins [0.0]
Predicting the effect of mutations in proteins is one of the most critical challenges in protein engineering.
We use clustering, embedding, and dimensionality reduction techniques to select combinations of physicochemical properties for the encoding stage.
We then select the best performing predictive models in each set of properties and create an assembled model.
arXiv Detail & Related papers (2020-10-07T16:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.