DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for
Automatic Protein Function Prediction
- URL: http://arxiv.org/abs/2307.13004v1
- Date: Mon, 24 Jul 2023 07:01:32 GMT
- Title: DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for
Automatic Protein Function Prediction
- Authors: Zihao Li, Changkun Jiang, and Jianqiang Li
- Abstract summary: Automatic protein function prediction (AFP) is classified as a large-scale multi-label classification problem.
Currently, popular methods primarily combine protein-related information and Gene Ontology (GO) terms to generate final functional predictions.
We propose a sequence-based hierarchical prediction method, DeepGATGO, which processes protein sequences and GO term labels hierarchically.
- Score: 4.608328575930055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic protein function prediction (AFP) is classified as a large-scale
multi-label classification problem aimed at automating protein enrichment
analysis to eliminate the current reliance on labor-intensive wet-lab methods.
Currently, popular methods primarily combine protein-related information and
Gene Ontology (GO) terms to generate final functional predictions. For example,
protein sequences, structural information, and protein-protein interaction
networks are integrated as prior knowledge to fuse with GO term embeddings and
generate the ultimate prediction results. However, these methods are limited by
the difficulty in obtaining structural information or network topology
information, as well as the accuracy of such data. Therefore, more and more
methods that only use protein sequences for protein function prediction have
been proposed, which is a more reliable and computationally cheaper approach.
However, the existing methods fail to fully extract feature information from
protein sequences or label data because they do not adequately consider the
intrinsic characteristics of the data itself. Therefore, we propose a
sequence-based hierarchical prediction method, DeepGATGO, which processes
protein sequences and GO term labels hierarchically, and utilizes graph
attention networks (GATs) and contrastive learning for protein function
prediction. Specifically, we compute embeddings of the sequence and label data
using pre-trained models to reduce computational costs and improve the
embedding accuracy. Then, we use GATs to dynamically extract the structural
information of non-Euclidean data, and learn general features of the label
dataset with contrastive learning by constructing positive and negative example
samples. Experimental results demonstrate that our proposed model exhibits
better scalability in GO term enrichment analysis on large-scale datasets.
Related papers
- SeqProFT: Applying LoRA Finetuning for Sequence-only Protein Property Predictions [8.112057136324431]
This study employs the LoRA method to perform end-to-end fine-tuning of the ESM-2 model.
A multi-head attention mechanism is integrated into the downstream network to combine sequence features with contact map information.
arXiv Detail & Related papers (2024-11-18T12:40:39Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - ProtIR: Iterative Refinement between Retrievers and Predictors for
Protein Function Annotation [38.019425619750265]
We introduce a novel variational pseudo-likelihood framework, ProtIR, designed to improve function predictors by incorporating inter-protein similarity modeling.
ProtIR showcases around 10% improvement over vanilla predictor-based methods.
It achieves performance on par with protein language model-based methods, yet without the need for massive pre-training.
arXiv Detail & Related papers (2024-02-10T17:31:46Z) - Structure-Informed Protein Language Model [38.019425619750265]
We introduce the integration of remote homology detection to distill structural information into protein language models.
We evaluate the impact of this structure-informed training on downstream protein function prediction tasks.
arXiv Detail & Related papers (2024-02-07T09:32:35Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z) - Bayesian neural network with pretrained protein embedding enhances
prediction accuracy of drug-protein interaction [3.499870393443268]
Deep learning approaches can predict drug-protein interactions without trial-and-error by humans.
We propose two methods to construct a deep learning framework that exhibits superior performance with a small labeled dataset.
arXiv Detail & Related papers (2020-12-15T10:24:34Z) - Deep Learning of High-Order Interactions for Protein Interface
Prediction [58.164371994210406]
We propose to formulate the protein interface prediction as a 2D dense prediction problem.
We represent proteins as graphs and employ graph neural networks to learn node features.
We incorporate high-order pairwise interactions to generate a 3D tensor containing different pairwise interactions.
arXiv Detail & Related papers (2020-07-18T05:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.