Pairing interacting protein sequences using masked language modeling
- URL: http://arxiv.org/abs/2308.07136v1
- Date: Mon, 14 Aug 2023 13:42:09 GMT
- Title: Pairing interacting protein sequences using masked language modeling
- Authors: Umberto Lupo, Damiano Sgarbossa, Anne-Florence Bitbol
- Abstract summary: We develop a method to pair interacting protein sequences using protein language models trained on sequence alignments.
We exploit the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context.
We show that it captures inter-chain coevolution while it was trained on single-chain data, which means that it can be used out-of-distribution.
- Score: 0.3222802562733787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting which proteins interact together from amino-acid sequences is an
important task. We develop a method to pair interacting protein sequences which
leverages the power of protein language models trained on multiple sequence
alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We
formulate the problem of pairing interacting partners among the paralogs of two
protein families in a differentiable way. We introduce a method called DiffPALM
that solves it by exploiting the ability of MSA Transformer to fill in masked
amino acids in multiple sequence alignments using the surrounding context. MSA
Transformer encodes coevolution between functionally or structurally coupled
amino acids. We show that it captures inter-chain coevolution, while it was
trained on single-chain data, which means that it can be used
out-of-distribution. Relying on MSA Transformer without fine-tuning, DiffPALM
outperforms existing coevolution-based pairing methods on difficult benchmarks
of shallow multiple sequence alignments extracted from ubiquitous prokaryotic
protein datasets. It also outperforms an alternative method based on a
state-of-the-art protein language model trained on single sequences. Paired
alignments of interacting protein sequences are a crucial ingredient of
supervised deep learning methods to predict the three-dimensional structure of
protein complexes. DiffPALM substantially improves the structure prediction of
some eukaryotic protein complexes by AlphaFold-Multimer, without significantly
deteriorating any of those we tested. It also achieves competitive performance
with using orthology-based pairing.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Structure Language Models for Protein Conformation Generation [66.42864253026053]
Traditional physics-based simulation methods often struggle with sampling equilibrium conformations.
Deep generative models have shown promise in generating protein conformations as a more efficient alternative.
We introduce Structure Language Modeling as a novel framework for efficient protein conformation generation.
arXiv Detail & Related papers (2024-10-24T03:38:51Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction [63.50967073653953]
Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery.
Existing deep learning-based methods utilize only the single modality of protein sequences or structures.
We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
arXiv Detail & Related papers (2024-02-13T03:51:10Z) - Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers [18.498779242323582]
We propose a novel approach, Prot2Text, which predicts a protein's function in a free text style.
By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types.
arXiv Detail & Related papers (2023-07-25T09:35:43Z) - Generative power of a protein language model trained on multiple
sequence alignments [0.5639904484784126]
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families.
Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end.
We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer.
arXiv Detail & Related papers (2022-04-14T16:59:05Z) - Protein language models trained on multiple sequence alignments learn
phylogenetic relationships [0.5639904484784126]
Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction.
We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs.
arXiv Detail & Related papers (2022-03-29T12:07:45Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - PANDA: Predicting the change in proteins binding affinity upon mutations
using sequence information [0.3867363075280544]
Determination of change in binding affinity upon mutations requires sophisticated, expensive, and time-consuming wet-lab experiments.
Most of the computational prediction techniques require protein structures that limit their applicability to protein complexes with known structures.
We have used protein sequence information instead of protein structures along with machine learning techniques to accurately predict the change in protein binding affinity upon mutation.
arXiv Detail & Related papers (2020-09-16T17:12:25Z) - Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein
Structures [18.961218808251076]
We propose two new learning operations enabling deep 3D analysis of large-scale protein data.
First, we introduce a novel convolution operator which considers both, the intrinsic (invariant under protein folding) as well as extrinsic (invariant under bonding) structure.
Second, we enable a multi-scale protein analysis by introducing hierarchical pooling operators, exploiting the fact that proteins are a recombination of a finite set of amino acids.
arXiv Detail & Related papers (2020-07-13T09:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.