Pairing interacting protein sequences using masked language modeling
- URL: http://arxiv.org/abs/2308.07136v1
- Date: Mon, 14 Aug 2023 13:42:09 GMT
- Title: Pairing interacting protein sequences using masked language modeling
- Authors: Umberto Lupo, Damiano Sgarbossa, Anne-Florence Bitbol
- Abstract summary: We develop a method to pair interacting protein sequences using protein language models trained on sequence alignments.
We exploit the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context.
We show that it captures inter-chain coevolution while it was trained on single-chain data, which means that it can be used out-of-distribution.
- Score: 0.3222802562733787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting which proteins interact together from amino-acid sequences is an
important task. We develop a method to pair interacting protein sequences which
leverages the power of protein language models trained on multiple sequence
alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We
formulate the problem of pairing interacting partners among the paralogs of two
protein families in a differentiable way. We introduce a method called DiffPALM
that solves it by exploiting the ability of MSA Transformer to fill in masked
amino acids in multiple sequence alignments using the surrounding context. MSA
Transformer encodes coevolution between functionally or structurally coupled
amino acids. We show that it captures inter-chain coevolution, while it was
trained on single-chain data, which means that it can be used
out-of-distribution. Relying on MSA Transformer without fine-tuning, DiffPALM
outperforms existing coevolution-based pairing methods on difficult benchmarks
of shallow multiple sequence alignments extracted from ubiquitous prokaryotic
protein datasets. It also outperforms an alternative method based on a
state-of-the-art protein language model trained on single sequences. Paired
alignments of interacting protein sequences are a crucial ingredient of
supervised deep learning methods to predict the three-dimensional structure of
protein complexes. DiffPALM substantially improves the structure prediction of
some eukaryotic protein complexes by AlphaFold-Multimer, without significantly
deteriorating any of those we tested. It also achieves competitive performance
with using orthology-based pairing.
Related papers
- Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction [63.50967073653953]
Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery.
Existing deep learning-based methods utilize only the single modality of protein sequences or structures.
We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
arXiv Detail & Related papers (2024-02-13T03:51:10Z) - Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers [18.498779242323582]
We propose a novel approach, Prot2Text, which predicts a protein's function in a free text style.
By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types.
arXiv Detail & Related papers (2023-07-25T09:35:43Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - Generative power of a protein language model trained on multiple
sequence alignments [0.5639904484784126]
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families.
Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end.
We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer.
arXiv Detail & Related papers (2022-04-14T16:59:05Z) - Protein language models trained on multiple sequence alignments learn
phylogenetic relationships [0.5639904484784126]
Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction.
We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs.
arXiv Detail & Related papers (2022-03-29T12:07:45Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - PANDA: Predicting the change in proteins binding affinity upon mutations
using sequence information [0.3867363075280544]
Determination of change in binding affinity upon mutations requires sophisticated, expensive, and time-consuming wet-lab experiments.
Most of the computational prediction techniques require protein structures that limit their applicability to protein complexes with known structures.
We have used protein sequence information instead of protein structures along with machine learning techniques to accurately predict the change in protein binding affinity upon mutation.
arXiv Detail & Related papers (2020-09-16T17:12:25Z) - Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein
Structures [18.961218808251076]
We propose two new learning operations enabling deep 3D analysis of large-scale protein data.
First, we introduce a novel convolution operator which considers both, the intrinsic (invariant under protein folding) as well as extrinsic (invariant under bonding) structure.
Second, we enable a multi-scale protein analysis by introducing hierarchical pooling operators, exploiting the fact that proteins are a recombination of a finite set of amino acids.
arXiv Detail & Related papers (2020-07-13T09:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.