Retrieved Sequence Augmentation for Protein Representation Learning
- URL: http://arxiv.org/abs/2302.12563v1
- Date: Fri, 24 Feb 2023 10:31:45 GMT
- Title: Retrieved Sequence Augmentation for Protein Representation Learning
- Authors: Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu,
Zhihong Deng, Yang Lu, Qi Liu, Lingpeng Kong
- Abstract summary: We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
- Score: 40.13920287967866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein language models have excelled in a variety of tasks, ranging from
structure prediction to protein engineering. However, proteins are highly
diverse in functions and structures, and current state-of-the-art models
including the latest version of AlphaFold rely on Multiple Sequence Alignments
(MSA) to feed in the evolutionary knowledge. Despite their success, heavy
computational overheads, as well as the de novo and orphan proteins remain
great challenges in protein representation learning. In this work, we show that
MSAaugmented models inherently belong to retrievalaugmented methods. Motivated
by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein
representation learning without additional alignment or pre-processing. RSA
links query protein sequences to a set of sequences with similar structures or
properties in the database and combines these sequences for downstream
prediction. We show that protein language models benefit from the retrieval
enhancement on both structure prediction and property prediction tasks, with a
5% improvement on MSA Transformer on average while being 373 times faster. In
addition, we show that our model can transfer to new protein domains better and
outperforms MSA Transformer on de novo protein prediction. Our study fills a
much-encountered gap in protein prediction and brings us a step closer to
demystifying the domain knowledge needed to understand protein sequences. Code
is available on https://github.com/HKUNLP/RSA.
Related papers
- Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - Structure-Informed Protein Language Model [38.019425619750265]
We introduce the integration of remote homology detection to distill structural information into protein language models.
We evaluate the impact of this structure-informed training on downstream protein function prediction tasks.
arXiv Detail & Related papers (2024-02-07T09:32:35Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - PoET: A generative model of protein families as sequences-of-sequences [5.05828899601167]
We propose a generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences.
PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest.
We show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all depths.
arXiv Detail & Related papers (2023-06-09T16:06:36Z) - ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language
Models [0.0]
In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues.
Protein language models (pLMs) have been a promising tool for protein sequence design.
We show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering.
arXiv Detail & Related papers (2023-03-29T04:35:50Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.