A Survey on Protein Representation Learning: Retrospect and Prospect
- URL: http://arxiv.org/abs/2301.00813v1
- Date: Sat, 31 Dec 2022 04:01:16 GMT
- Title: A Survey on Protein Representation Learning: Retrospect and Prospect
- Authors: Lirong Wu, Yufei Huang, Haitao Lin, Stan Z. Li
- Abstract summary: Protein representation learning is a promising research topic for extracting informative knowledge from massive protein sequences or structures.
We introduce the motivations for protein representation learning and formulate it in a general and unified framework.
Next, we divide existing PRL methods into three main categories: sequence-based, structure-based, and sequence-structure co-modeling.
- Score: 42.38007308086495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proteins are fundamental biological entities that play a key role in life
activities. The amino acid sequences of proteins can be folded into stable 3D
structures in the real physicochemical world, forming a special kind of
sequence-structure data. With the development of Artificial Intelligence (AI)
techniques, Protein Representation Learning (PRL) has recently emerged as a
promising research topic for extracting informative knowledge from massive
protein sequences or structures. To pave the way for AI researchers with little
bioinformatics background, we present a timely and comprehensive review of PRL
formulations and existing PRL methods from the perspective of model
architectures, pretext tasks, and downstream applications. We first briefly
introduce the motivations for protein representation learning and formulate it
in a general and unified framework. Next, we divide existing PRL methods into
three main categories: sequence-based, structure-based, and sequence-structure
co-modeling. Finally, we discuss some technical challenges and potential
directions for improving protein representation learning. The latest advances
in PRL methods are summarized in a GitHub repository
https://github.com/LirongWu/awesome-protein-representation-learning.
Related papers
- Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance? [4.7077642423577775]
We propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation.
Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains.
arXiv Detail & Related papers (2024-06-28T08:54:37Z) - Geometric Self-Supervised Pretraining on 3D Protein Structures using Subgraphs [25.93347924265175]
We propose a novel self-supervised method to pretrain 3D graph neural networks on 3D protein structures.
By considering subgraphs and their relationships to the global protein structure, the model can learn to reason about these hierarchical levels of organization.
arXiv Detail & Related papers (2024-06-20T09:34:31Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - A Systematic Study of Joint Representation Learning on Protein Sequences
and Structures [38.94729758958265]
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions.
Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge.
Our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM with distinct structure encoders.
arXiv Detail & Related papers (2023-03-11T01:24:10Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Deep Learning Methods for Protein Family Classification on PDB
Sequencing Data [0.0]
We demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data.
Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.
arXiv Detail & Related papers (2022-07-14T06:11:32Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Transfer Learning for Protein Structure Classification at Low Resolution [124.5573289131546]
We show that it is possible to make accurate ($geq$80%) predictions of protein class and architecture from structures determined at low ($leq$3A) resolution.
We provide proof of concept for high-speed, low-cost protein structure classification at low resolution, and a basis for extension to prediction of function.
arXiv Detail & Related papers (2020-08-11T15:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.