Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks
- URL: http://arxiv.org/abs/2012.03084v1
- Date: Sat, 5 Dec 2020 17:37:41 GMT
- Title: Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks
- Authors: Modestas Filipavicius, Matteo Manica, Joris Cadow, Maria Rodriguez
Martinez
- Abstract summary: Less than 1% of protein sequences are structurally and functionally annotated.
We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences.
We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
- Score: 1.452875650827562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Less than 1% of protein sequences are structurally and functionally
annotated. Natural Language Processing (NLP) community has recently embraced
self-supervised learning as a powerful approach to learn representations from
unlabeled text, in large part due to the attention-based context-aware
Transformer models. In this work we present a modification to the RoBERTa model
by inputting during pre-training a mixture of binding and non-binding protein
sequences (from STRING database). However, the sequence pairs have no label to
indicate their binding status, as the model relies solely on Masked Language
Modeling (MLM) objective during pre-training. After fine-tuning, such approach
surpasses models trained on single protein sequences for protein-protein
binding prediction, TCR-epitope binding prediction, cellular-localization and
remote homology classification tasks. We suggest that the Transformer's
attention mechanism contributes to protein binding site discovery. Furthermore,
we compress protein sequences by 64% with the Byte Pair Encoding (BPE)
vocabulary consisting of 10K subwords, each around 3-4 amino acids long.
Finally, to expand the model input space to even larger proteins and
multi-protein assemblies, we pre-train Longformer models that support 2,048
tokens. Further work in token-level classification for secondary structure
prediction is needed. Code available at:
https://github.com/PaccMann/paccmann_proteomics
Related papers
- PLA-SGCN: Protein-Ligand Binding Affinity Prediction by Integrating Similar Pairs and Semi-supervised Graph Convolutional Network [6.024776891570197]
This paper aims to integrate retrieved hard protein-ligand pairs in PLA prediction (i.e., task prediction step) using a semi-supervised graph convolutional network (GCN)
The results show that the proposed method significantly performs better than the comparable approaches.
arXiv Detail & Related papers (2024-05-13T03:27:02Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - FoldToken: Learning Protein Language via Vector Quantization and Beyond [56.19308144551836]
We introduce textbfFoldTokenizer to represent protein sequence-structure as discrete symbols.
We refer to the learned symbols as textbfFoldToken, and the sequence of FoldTokens serves as a new protein language.
arXiv Detail & Related papers (2024-02-04T12:18:51Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - PoET: A generative model of protein families as sequences-of-sequences [5.05828899601167]
We propose a generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences.
PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest.
We show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all depths.
arXiv Detail & Related papers (2023-06-09T16:06:36Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.