A Survey on Protein Representation Learning: Retrospect and Prospect
- URL: http://arxiv.org/abs/2301.00813v1
- Date: Sat, 31 Dec 2022 04:01:16 GMT
- Title: A Survey on Protein Representation Learning: Retrospect and Prospect
- Authors: Lirong Wu, Yufei Huang, Haitao Lin, Stan Z. Li
- Abstract summary: Protein representation learning is a promising research topic for extracting informative knowledge from massive protein sequences or structures.
We introduce the motivations for protein representation learning and formulate it in a general and unified framework.
Next, we divide existing PRL methods into three main categories: sequence-based, structure-based, and sequence-structure co-modeling.
- Score: 42.38007308086495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proteins are fundamental biological entities that play a key role in life
activities. The amino acid sequences of proteins can be folded into stable 3D
structures in the real physicochemical world, forming a special kind of
sequence-structure data. With the development of Artificial Intelligence (AI)
techniques, Protein Representation Learning (PRL) has recently emerged as a
promising research topic for extracting informative knowledge from massive
protein sequences or structures. To pave the way for AI researchers with little
bioinformatics background, we present a timely and comprehensive review of PRL
formulations and existing PRL methods from the perspective of model
architectures, pretext tasks, and downstream applications. We first briefly
introduce the motivations for protein representation learning and formulate it
in a general and unified framework. Next, we divide existing PRL methods into
three main categories: sequence-based, structure-based, and sequence-structure
co-modeling. Finally, we discuss some technical challenges and potential
directions for improving protein representation learning. The latest advances
in PRL methods are summarized in a GitHub repository
https://github.com/LirongWu/awesome-protein-representation-learning.
Related papers
- Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design [3.5897534810405403]
Deep learning has demonstrated remarkable performance in fields such as computer vision and natural language processing.
It has been increasingly applied in recent years to the data-rich domain of protein sequences with great success.
The performance improvements achieved by deep learning unlocks new possibilities in the field of protein bioinformatics.
arXiv Detail & Related papers (2025-01-02T05:21:34Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Recent advances in interpretable machine learning using structure-based protein representations [30.907048279915312]
Recent advancements in machine learning (ML) are transforming the field of structural biology.
We present various methods for representing protein 3D structures from low to high-resolution.
We show how interpretable ML methods can support tasks such as predicting protein structures, protein function, and protein-protein interactions.
arXiv Detail & Related papers (2024-09-26T10:56:27Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - A Systematic Study of Joint Representation Learning on Protein Sequences
and Structures [38.94729758958265]
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions.
Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge.
Our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM with distinct structure encoders.
arXiv Detail & Related papers (2023-03-11T01:24:10Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Deep Learning Methods for Protein Family Classification on PDB
Sequencing Data [0.0]
We demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data.
Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.
arXiv Detail & Related papers (2022-07-14T06:11:32Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.