ProtFAD: Introducing function-aware domains as implicit modality towards protein function perception
- URL: http://arxiv.org/abs/2405.15158v1
- Date: Fri, 24 May 2024 02:26:45 GMT
- Title: ProtFAD: Introducing function-aware domains as implicit modality towards protein function perception
- Authors: Mingqing Wang, Zhiwei Nie, Yonghong He, Zhixiang Ren,
- Abstract summary: We propose a function-aware domain representation and a domain-joint contrastive learning strategy to distinguish different protein functions.
Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks.
- Score: 0.3928425951824076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a function-aware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we associate domains with the GO terms as function priors to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor.
Related papers
- OneProt: Towards Multi-Modal Protein Foundation Models [5.440531199006399]
We introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data.
It surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction.
This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.
arXiv Detail & Related papers (2024-11-07T16:54:54Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals [4.525216077859531]
We introduce the Protein Region Proposal Network (ProteinRPN) for accurate protein function prediction.
ProteinRPN identifies potential functional regions (anchors) which are refined through the hierarchy-aware node drop pooling layer.
The representations of the predicted functional nodes are enriched using attention mechanisms and fed into a Graph Multiset Transformer.
arXiv Detail & Related papers (2024-09-01T04:40:04Z) - A PLMs based protein retrieval framework [4.110243520064533]
We propose a novel protein retrieval framework that mitigates the bias towards sequence similarity.
Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space.
Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins.
arXiv Detail & Related papers (2024-07-16T09:52:42Z) - Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance? [4.7077642423577775]
We propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation.
Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains.
arXiv Detail & Related papers (2024-06-28T08:54:37Z) - Clustering for Protein Representation Learning [72.72957540484664]
We propose a neural clustering framework that can automatically discover the critical components of a protein.
Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids.
We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene term prediction, and enzyme commission number prediction.
arXiv Detail & Related papers (2024-03-30T05:51:09Z) - PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction [63.50967073653953]
Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery.
Existing deep learning-based methods utilize only the single modality of protein sequences or structures.
We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
arXiv Detail & Related papers (2024-02-13T03:51:10Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Multi-Scale Representation Learning on Proteins [78.31410227443102]
This paper introduces a multi-scale graph construction of a protein -- HoloProt.
The surface captures coarser details of the protein, while sequence as primary component and structure captures finer details.
Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level.
arXiv Detail & Related papers (2022-04-04T08:29:17Z) - PersGNN: Applying Topological Data Analysis and Geometric Deep Learning
to Structure-Based Protein Function Prediction [0.07340017786387766]
In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank.
We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis.
arXiv Detail & Related papers (2020-10-30T02:24:35Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.