Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model
- URL: http://arxiv.org/abs/2110.15527v1
- Date: Fri, 29 Oct 2021 04:01:32 GMT
- Title: Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model
- Authors: Liang He, Shizhuo Zhang, Lijun Wu, Huanhuan Xia, Fusong Ju, He Zhang,
Siyuan Liu, Yingce Xia, Jianwei Zhu, Pan Deng, Bin Shao, Tao Qin, Tie-Yan Liu
- Abstract summary: Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
- Score: 93.9943278892735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding protein sequences is vital and urgent for biology, healthcare,
and medicine. Labeling approaches are expensive yet time-consuming, while the
amount of unlabeled data is increasing quite faster than that of the labeled
data due to low-cost, high-throughput sequencing methods. In order to extract
knowledge from these unlabeled data, representation learning is of significant
value for protein-related tasks and has great potential for helping us learn
more about protein functions and structures. The key problem in the protein
sequence representation learning is to capture the co-evolutionary information
reflected by the inter-residue co-variation in the sequences. Instead of
leveraging multiple sequence alignment as is usually done, we propose a novel
method to capture this information directly by pre-training via a dedicated
language model, i.e., Pairwise Masked Language Model (PMLM). In a conventional
masked language model, the masked tokens are modeled by conditioning on the
unmasked tokens only, but processed independently to each other. However, our
proposed PMLM takes the dependency among masked tokens into consideration,
i.e., the probability of a token pair is not equal to the product of the
probability of the two tokens. By applying this model, the pre-trained encoder
is able to generate a better representation for protein sequences. Our result
shows that the proposed method can effectively capture the inter-residue
correlations and improves the performance of contact prediction by up to 9%
compared to the MLM baseline under the same setting. The proposed model also
significantly outperforms the MSA baseline by more than 7% on the TAPE contact
prediction benchmark when pre-trained on a subset of the sequence database
which the MSA is generated from, revealing the potential of the sequence
pre-training method to surpass MSA based methods in general.
Related papers
- TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Toward Understanding BERT-Like Pre-Training for DNA Foundation Models [78.48760388079523]
Existing pre-training methods for DNA sequences rely on direct adoptions of BERT pre-training from NLP.
We introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pre-training by continuously expanding its mask boundary.
RandomMask achieves a staggering 68.16% in Matthew's correlation coefficient for Epigenetic Mark Prediction, a groundbreaking increase of 19.85% over the baseline.
arXiv Detail & Related papers (2023-10-11T16:40:57Z) - PEvoLM: Protein Sequence Evolutionary Information Language Model [0.0]
A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs)
This research presents an Embedding Language Model (ELMo), converting a protein sequence to a numerical vector representation.
The model was trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences.
arXiv Detail & Related papers (2023-08-16T06:46:28Z) - ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts [22.870765825298268]
We build a ProtST dataset to augment protein sequences with text descriptions of their functions and other important properties.
During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction.
On downstream tasks, ProtST enables both supervised learning and zero-shot prediction.
arXiv Detail & Related papers (2023-01-28T00:58:48Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene [10.822477939237459]
We propose contrastive masked language modeling (CMLM) for post-training to integrate both token-level and sequence-level contrastive learnings.
CMLM surpasses several recent post-training methods in few-shot settings without the need for data augmentation.
arXiv Detail & Related papers (2021-06-04T08:17:48Z) - Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated.
We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences.
We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z) - Profile Prediction: An Alignment-Based Pre-Training Task for Protein
Sequence Models [11.483725773928382]
Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks.
We introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments.
Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases.
arXiv Detail & Related papers (2020-12-01T01:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.