Multi-modal Protein Knowledge Graph Construction and Applications
- URL: http://arxiv.org/abs/2207.10080v3
- Date: Mon, 14 Nov 2022 16:26:52 GMT
- Title: Multi-modal Protein Knowledge Graph Construction and Applications
- Authors: Siyuan Cheng, Xiaozhuan Liang, Zhen Bi, Huajun Chen, Ningyu Zhang
- Abstract summary: We create ProteinKG65, a knowledge graph for protein science.
Using gene ontology and Uniprot knowledge base as a basis, we transform various kinds of knowledge with aligned descriptions and protein sequences.
ProteinKG65 is mainly dedicated to providing a specialized protein knowledge graph, bringing the knowledge of Gene Ontology to protein function and structure prediction.
- Score: 30.500520131560112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing data-centric methods for protein science generally cannot
sufficiently capture and leverage biology knowledge, which may be crucial for
many protein tasks. To facilitate research in this field, we create
ProteinKG65, a knowledge graph for protein science. Using gene ontology and
Uniprot knowledge base as a basis, we transform and integrate various kinds of
knowledge with aligned descriptions and protein sequences, respectively, to GO
terms and protein entities. ProteinKG65 is mainly dedicated to providing a
specialized protein knowledge graph, bringing the knowledge of Gene Ontology to
protein function and structure prediction. We also illustrate the potential
applications of ProteinKG65 with a prototype. Our dataset can be downloaded at
https://w3id.org/proteinkg65.
Related papers
- Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX [14.927425008686692]
We introduce HelixProtX, a system built upon the large multimodal model, to support any-to-any protein modality generation.
HelixProtX consistently achieves superior accuracy across a range of protein-related tasks, outperforming existing state-of-the-art models.
arXiv Detail & Related papers (2024-07-12T14:03:02Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - A Text-guided Protein Design Framework [109.18157766856196]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.
ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.
We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 10 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - OntoProtein: Protein Pretraining With Gene Ontology Embedding [36.92674447484136]
We propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models.
We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph.
arXiv Detail & Related papers (2022-01-23T14:49:49Z) - Deep Generative Modeling for Protein Design [0.0]
Deep learning approaches have produced breakthroughs in fields such as image classification and natural language processing.
generative models of proteins have been developed that encompass all known protein sequences, model specific protein families, or extrapolate the dynamics of individual proteins.
We discuss five classes of generative models that have been most successful at modeling proteins and provide a framework for model guided protein design.
arXiv Detail & Related papers (2021-08-31T14:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.