A Survey on Protein Representation Learning: Retrospect and Prospect
- URL: http://arxiv.org/abs/2301.00813v1
- Date: Sat, 31 Dec 2022 04:01:16 GMT
- Title: A Survey on Protein Representation Learning: Retrospect and Prospect
- Authors: Lirong Wu, Yufei Huang, Haitao Lin, Stan Z. Li
- Abstract summary: Protein representation learning is a promising research topic for extracting informative knowledge from massive protein sequences or structures.
We introduce the motivations for protein representation learning and formulate it in a general and unified framework.
Next, we divide existing PRL methods into three main categories: sequence-based, structure-based, and sequence-structure co-modeling.
- Score: 42.38007308086495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proteins are fundamental biological entities that play a key role in life
activities. The amino acid sequences of proteins can be folded into stable 3D
structures in the real physicochemical world, forming a special kind of
sequence-structure data. With the development of Artificial Intelligence (AI)
techniques, Protein Representation Learning (PRL) has recently emerged as a
promising research topic for extracting informative knowledge from massive
protein sequences or structures. To pave the way for AI researchers with little
bioinformatics background, we present a timely and comprehensive review of PRL
formulations and existing PRL methods from the perspective of model
architectures, pretext tasks, and downstream applications. We first briefly
introduce the motivations for protein representation learning and formulate it
in a general and unified framework. Next, we divide existing PRL methods into
three main categories: sequence-based, structure-based, and sequence-structure
co-modeling. Finally, we discuss some technical challenges and potential
directions for improving protein representation learning. The latest advances
in PRL methods are summarized in a GitHub repository
https://github.com/LirongWu/awesome-protein-representation-learning.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding [43.811432723460534]
We introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap.
Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins.
We construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model.
arXiv Detail & Related papers (2024-10-04T16:02:50Z) - Recent advances in interpretable machine learning using structure-based protein representations [30.907048279915312]
Recent advancements in machine learning (ML) are transforming the field of structural biology.
We present various methods for representing protein 3D structures from low to high-resolution.
We show how interpretable ML methods can support tasks such as predicting protein structures, protein function, and protein-protein interactions.
arXiv Detail & Related papers (2024-09-26T10:56:27Z) - Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance? [4.7077642423577775]
We propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation.
Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains.
arXiv Detail & Related papers (2024-06-28T08:54:37Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - A Systematic Study of Joint Representation Learning on Protein Sequences
and Structures [38.94729758958265]
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions.
Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge.
Our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM with distinct structure encoders.
arXiv Detail & Related papers (2023-03-11T01:24:10Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Deep Learning Methods for Protein Family Classification on PDB
Sequencing Data [0.0]
We demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data.
Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.
arXiv Detail & Related papers (2022-07-14T06:11:32Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Transfer Learning for Protein Structure Classification at Low Resolution [124.5573289131546]
We show that it is possible to make accurate ($geq$80%) predictions of protein class and architecture from structures determined at low ($leq$3A) resolution.
We provide proof of concept for high-speed, low-cost protein structure classification at low resolution, and a basis for extension to prediction of function.
arXiv Detail & Related papers (2020-08-11T15:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.