A Systematic Study of Joint Representation Learning on Protein Sequences
and Structures
- URL: http://arxiv.org/abs/2303.06275v2
- Date: Wed, 18 Oct 2023 16:11:11 GMT
- Title: A Systematic Study of Joint Representation Learning on Protein Sequences
and Structures
- Authors: Zuobai Zhang, Chuanrui Wang, Minghao Xu, Vijil Chenthamarakshan,
Aur\'elie Lozano, Payel Das, Jian Tang
- Abstract summary: Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions.
Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge.
Our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM with distinct structure encoders.
- Score: 38.94729758958265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning effective protein representations is critical in a variety of tasks
in biology such as predicting protein functions. Recent sequence representation
learning methods based on Protein Language Models (PLMs) excel in
sequence-based tasks, but their direct adaptation to tasks involving protein
structures remains a challenge. In contrast, structure-based methods leverage
3D structural information with graph neural networks and geometric pre-training
methods show potential in function prediction tasks, but still suffers from the
limited number of available structures. To bridge this gap, our study
undertakes a comprehensive exploration of joint protein representation learning
by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders
(GVP, GearNet, CDConv). We introduce three representation fusion strategies and
explore different pre-training techniques. Our method achieves significant
improvements over existing sequence- and structure-based methods, setting new
state-of-the-art for function annotation. This study underscores several
important design choices for fusing protein sequence and structure information.
Our implementation is available at
https://github.com/DeepGraphLearning/ESM-GearNet.
Related papers
- Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance? [4.7077642423577775]
We propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation.
Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains.
arXiv Detail & Related papers (2024-06-28T08:54:37Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Neural Embeddings for Protein Graphs [0.8258451067861933]
We propose a novel framework for embedding protein graphs in geometric vector spaces.
We learn an encoder function that preserves the structural distance between protein graphs.
Our framework achieves remarkable results in the task of protein structure classification.
arXiv Detail & Related papers (2023-06-07T14:50:34Z) - Integration of Pre-trained Protein Language Models into Geometric Deep
Learning Networks [68.90692290665648]
We integrate knowledge learned by protein language models into several state-of-the-art geometric networks.
Our findings show an overall improvement of 20% over baselines.
Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin.
arXiv Detail & Related papers (2022-12-07T04:04:04Z) - Contrastive Representation Learning for 3D Protein Structures [13.581113136149469]
We introduce a new representation learning framework for 3D protein structures.
Our framework uses unsupervised contrastive learning to learn meaningful representations of protein structures.
We show, how these representations can be used to solve a large variety of tasks, such as protein function prediction, protein fold classification, structural similarity prediction, and protein-ligand binding affinity prediction.
arXiv Detail & Related papers (2022-05-31T10:33:06Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein sequence-to-structure learning: Is this the end(-to-end
revolution)? [0.8399688944263843]
In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy.
Novel emerging approaches include (i) geometric learning, i.e. learning on representations such as graphs, 3D Voronoi tessellations, and point clouds.
We provide an overview and our opinion of the novel deep learning approaches developed in the last two years and widely used in CASP14.
arXiv Detail & Related papers (2021-05-16T10:46:44Z) - PersGNN: Applying Topological Data Analysis and Geometric Deep Learning
to Structure-Based Protein Function Prediction [0.07340017786387766]
In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank.
We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis.
arXiv Detail & Related papers (2020-10-30T02:24:35Z) - Transfer Learning for Protein Structure Classification at Low Resolution [124.5573289131546]
We show that it is possible to make accurate ($geq$80%) predictions of protein class and architecture from structures determined at low ($leq$3A) resolution.
We provide proof of concept for high-speed, low-cost protein structure classification at low resolution, and a basis for extension to prediction of function.
arXiv Detail & Related papers (2020-08-11T15:01:32Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.