OntoProtein: Protein Pretraining With Gene Ontology Embedding
- URL: http://arxiv.org/abs/2201.11147v1
- Date: Sun, 23 Jan 2022 14:49:49 GMT
- Title: OntoProtein: Protein Pretraining With Gene Ontology Embedding
- Authors: Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Haosen Hong,
Shumin Deng, Jiazhang Lian, Qiang Zhang, Huajun Chen
- Abstract summary: We propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models.
We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph.
- Score: 36.92674447484136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised protein language models have proved their effectiveness in
learning the proteins representations. With the increasing computational power,
current protein language models pre-trained with millions of diverse sequences
can advance the parameter scale from million-level to billion-level and achieve
remarkable improvement. However, those prevailing approaches rarely consider
incorporating knowledge graphs (KGs), which can provide rich structured
knowledge facts for better protein representations. We argue that informative
biology knowledge in KGs can enhance protein representation with external
knowledge. In this work, we propose OntoProtein, the first general framework
that makes use of structure in GO (Gene Ontology) into protein pre-training
models. We construct a novel large-scale knowledge graph that consists of GO
and its related proteins, and gene annotation texts or protein sequences
describe all nodes in the graph. We propose novel contrastive learning with
knowledge-aware negative sampling to jointly optimize the knowledge graph and
protein embedding during pre-training. Experimental results show that
OntoProtein can surpass state-of-the-art methods with pre-trained protein
language models in TAPE benchmark and yield better performance compared with
baselines in protein-protein interaction and protein function prediction. Code
and datasets are available in https://github.com/zjunlp/OntoProtein.
Related papers
- Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - Advanced atom-level representations for protein flexibility prediction utilizing graph neural networks [0.0]
We propose graph neural networks (GNNs) to learn protein representations at the atomic level and predict B-factors from protein 3D structures.
The Meta-GNN model achieves a correlation coefficient of 0.71 on a large and diverse test set of over 4k proteins.
arXiv Detail & Related papers (2024-08-22T16:15:13Z) - GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning [27.192150057715835]
GOProteinGNN is a novel architecture that enhances protein language models by integrating protein knowledge graph information.
Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process.
arXiv Detail & Related papers (2024-07-31T17:54:22Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Integration of Pre-trained Protein Language Models into Geometric Deep
Learning Networks [68.90692290665648]
We integrate knowledge learned by protein language models into several state-of-the-art geometric networks.
Our findings show an overall improvement of 20% over baselines.
Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin.
arXiv Detail & Related papers (2022-12-07T04:04:04Z) - Multi-modal Protein Knowledge Graph Construction and Applications [30.500520131560112]
We create ProteinKG65, a knowledge graph for protein science.
Using gene ontology and Uniprot knowledge base as a basis, we transform various kinds of knowledge with aligned descriptions and protein sequences.
ProteinKG65 is mainly dedicated to providing a specialized protein knowledge graph, bringing the knowledge of Gene Ontology to protein function and structure prediction.
arXiv Detail & Related papers (2022-05-27T08:18:56Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.