Deep Manifold Transformation for Protein Representation Learning
- URL: http://arxiv.org/abs/2402.09416v1
- Date: Fri, 12 Jan 2024 18:38:14 GMT
- Title: Deep Manifold Transformation for Protein Representation Learning
- Authors: Bozhen Hu, Zelin Zang, Cheng Tan, Stan Z. Li
- Abstract summary: We propose a new underlinedeep underlinemanifold underlinetrans approach for universal underlineprotein underlinerepresentation underlinelformation (DMTPRL)
It employs manifold learning strategies to improve the quality and adaptability of the learned embeddings.
Our proposed DMTPRL method outperforms state-of-the-art baselines on diverse downstream tasks across popular datasets.
- Score: 42.43017670985785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein representation learning is critical in various tasks in biology, such
as drug design and protein structure or function prediction, which has
primarily benefited from protein language models and graph neural networks.
These models can capture intrinsic patterns from protein sequences and
structures through masking and task-related losses. However, the learned
protein representations are usually not well optimized, leading to performance
degradation due to limited data, difficulty adapting to new tasks, etc. To
address this, we propose a new \underline{d}eep \underline{m}anifold
\underline{t}ransformation approach for universal \underline{p}rotein
\underline{r}epresentation \underline{l}earning (DMTPRL). It employs manifold
learning strategies to improve the quality and adaptability of the learned
embeddings. Specifically, we apply a novel manifold learning loss during
training based on the graph inter-node similarity. Our proposed DMTPRL method
outperforms state-of-the-art baselines on diverse downstream tasks across
popular datasets. This validates our approach for learning universal and robust
protein representations. We promise to release the code after acceptance.
Related papers
- Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - A Systematic Study of Joint Representation Learning on Protein Sequences
and Structures [38.94729758958265]
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions.
Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge.
Our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM with distinct structure encoders.
arXiv Detail & Related papers (2023-03-11T01:24:10Z) - Boosting Convolutional Neural Networks' Protein Binding Site Prediction
Capacity Using SE(3)-invariant transformers, Transfer Learning and
Homology-based Augmentation [1.160208922584163]
Figuring out small binding sites in target proteins, in the resolution of either pocket or residue, is critical in real drugdiscovery scenarios.
Here we present a new computational method for binding site prediction that is relevant to real world applications.
arXiv Detail & Related papers (2023-02-20T05:02:40Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Multi-Scale Representation Learning on Proteins [78.31410227443102]
This paper introduces a multi-scale graph construction of a protein -- HoloProt.
The surface captures coarser details of the protein, while sequence as primary component and structure captures finer details.
Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level.
arXiv Detail & Related papers (2022-04-04T08:29:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.