Related papers: Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

URL: http://arxiv.org/abs/2012.00195v1
Date: Tue, 1 Dec 2020 01:01:34 GMT
Title: Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models
Authors: Pascal Sturmfels, Jesse Vig, Ali Madani, Nazneen Fatema Rajani
Abstract summary: Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks. We introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments. Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases.
Score: 11.483725773928382
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For protein sequence datasets, unlabeled data has greatly outpaced labeled data due to the high cost of wet-lab characterization. Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks. However, the optimal pre-training strategy remains an open question. Instead of strictly borrowing from natural language processing (NLP) in the form of masked or autoregressive language modeling, we introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments. Using a set of five, standardized downstream tasks for protein models, we demonstrate that our pre-training task along with a multi-task objective outperforms masked language modeling alone on all five tasks. Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases that go beyond existing language modeling techniques in NLP.

Related papers

Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations [0.3124884279860061]
Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences.<n>Our research investigated a multi-task pre-training strategy for PLMs.<n>This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences.
arXiv Detail & Related papers (2025-05-26T14:41:10Z)
SeqProFT: Applying LoRA Finetuning for Sequence-only Protein Property Predictions [8.112057136324431]
This study employs the LoRA method to perform end-to-end fine-tuning of the ESM-2 model. A multi-head attention mechanism is integrated into the downstream network to combine sequence features with contact map information.
arXiv Detail & Related papers (2024-11-18T12:40:39Z)
Metalic: Meta-Learning In-Context with Protein Language Models [5.868595531658237]
Machine learning has emerged as a promising technique for such prediction tasks. Due to data scarcity, we believe meta-learning will play a pivotal role in advancing protein engineering.
arXiv Detail & Related papers (2024-10-10T20:19:35Z)
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities. We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z)
DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for Automatic Protein Function Prediction [4.608328575930055]
Automatic protein function prediction (AFP) is classified as a large-scale multi-label classification problem. Currently, popular methods primarily combine protein-related information and Gene Ontology (GO) terms to generate final functional predictions. We propose a sequence-based hierarchical prediction method, DeepGATGO, which processes protein sequences and GO term labels hierarchically.
arXiv Detail & Related papers (2023-07-24T07:01:32Z)
Reprogramming Pretrained Language Models for Protein Sequence Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences. Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information. We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM) Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)
On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance. We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.