Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning
- URL: http://arxiv.org/abs/2301.02120v1
- Date: Thu, 5 Jan 2023 15:55:18 GMT
- Title: Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning
- Authors: Ria Vinod, Pin-Yu Chen, and Payel Das
- Abstract summary: We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
- Score: 68.75392232599654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Learning-guided solutions for protein learning tasks have made
significant headway in recent years. However, success in scientific discovery
tasks is limited by the accessibility of well-defined and labeled in-domain
data. To tackle the low-data constraint, recent adaptions of deep learning
models pretrained on millions of protein sequences have shown promise; however,
the construction of such domain-specific large-scale model is computationally
expensive. Here, we propose Representation Learning via Dictionary Learning
(R2DL), an end-to-end representation learning framework in which we reprogram
deep models for alternate-domain tasks that can perform well on protein
property prediction with significantly fewer training samples. R2DL reprograms
a pretrained English language model to learn the embeddings of protein
sequences, by learning a sparse linear mapping between English and protein
sequence vocabulary embeddings. Our model can attain better accuracy and
significantly improve the data efficiency by up to $10^5$ times over the
baselines set by pretrained and standard supervised methods. To this end, we
reprogram an off-the-shelf pre-trained English language transformer and
benchmark it on a set of protein physicochemical prediction tasks (secondary
structure, stability, homology, stability) as well as on a biomedically
relevant set of protein function prediction tasks (antimicrobial, toxicity,
antibody affinity).
Related papers
- Metalic: Meta-Learning In-Context with Protein Language Models [5.868595531658237]
Machine learning has emerged as a promising technique for such prediction tasks.
Due to data scarcity, we believe meta-learning will play a pivotal role in advancing protein engineering.
arXiv Detail & Related papers (2024-10-10T20:19:35Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated.
We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences.
We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z) - Profile Prediction: An Alignment-Based Pre-Training Task for Protein
Sequence Models [11.483725773928382]
Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks.
We introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments.
Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases.
arXiv Detail & Related papers (2020-12-01T01:01:34Z) - Is Transfer Learning Necessary for Protein Landscape Prediction? [14.098875826640883]
We show that CNN models trained solely using supervised learning both compete with and sometimes outperform the best models from TAPE.
The benchmarking tasks proposed by TAPE are excellent measures of a model's ability to predict protein function and should be used going forward.
arXiv Detail & Related papers (2020-10-31T20:41:36Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.