Profile Prediction: An Alignment-Based Pre-Training Task for Protein
Sequence Models
- URL: http://arxiv.org/abs/2012.00195v1
- Date: Tue, 1 Dec 2020 01:01:34 GMT
- Title: Profile Prediction: An Alignment-Based Pre-Training Task for Protein
Sequence Models
- Authors: Pascal Sturmfels, Jesse Vig, Ali Madani, Nazneen Fatema Rajani
- Abstract summary: Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks.
We introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments.
Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases.
- Score: 11.483725773928382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For protein sequence datasets, unlabeled data has greatly outpaced labeled
data due to the high cost of wet-lab characterization. Recent deep-learning
approaches to protein prediction have shown that pre-training on unlabeled data
can yield useful representations for downstream tasks. However, the optimal
pre-training strategy remains an open question. Instead of strictly borrowing
from natural language processing (NLP) in the form of masked or autoregressive
language modeling, we introduce a new pre-training task: directly predicting
protein profiles derived from multiple sequence alignments. Using a set of
five, standardized downstream tasks for protein models, we demonstrate that our
pre-training task along with a multi-task objective outperforms masked language
modeling alone on all five tasks. Our results suggest that protein sequence
models may benefit from leveraging biologically-inspired inductive biases that
go beyond existing language modeling techniques in NLP.
Related papers
- How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for
Automatic Protein Function Prediction [4.608328575930055]
Automatic protein function prediction (AFP) is classified as a large-scale multi-label classification problem.
Currently, popular methods primarily combine protein-related information and Gene Ontology (GO) terms to generate final functional predictions.
We propose a sequence-based hierarchical prediction method, DeepGATGO, which processes protein sequences and GO term labels hierarchically.
arXiv Detail & Related papers (2023-07-24T07:01:32Z) - ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts [22.870765825298268]
We build a ProtST dataset to augment protein sequences with text descriptions of their functions and other important properties.
During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction.
On downstream tasks, ProtST enables both supervised learning and zero-shot prediction.
arXiv Detail & Related papers (2023-01-28T00:58:48Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.