Self-Supervised Graph Transformer on Large-Scale Molecular Data
- URL: http://arxiv.org/abs/2007.02835v2
- Date: Thu, 29 Oct 2020 03:46:04 GMT
- Title: Self-Supervised Graph Transformer on Large-Scale Molecular Data
- Authors: Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing
Huang, Junzhou Huang
- Abstract summary: We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
- Score: 73.3448373618865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to obtain informative representations of molecules is a crucial
prerequisite in AI-driven drug design and discovery. Recent researches abstract
molecules as graphs and employ Graph Neural Networks (GNNs) for molecular
representation learning. Nevertheless, two issues impede the usage of GNNs in
real scenarios: (1) insufficient labeled molecules for supervised training; (2)
poor generalization capability to new-synthesized molecules. To address them
both, we propose a novel framework, GROVER, which stands for Graph
Representation frOm self-superVised mEssage passing tRansformer. With carefully
designed self-supervised tasks in node-, edge- and graph-level, GROVER can
learn rich structural and semantic information of molecules from enormous
unlabelled molecular data. Rather, to encode such complex information, GROVER
integrates Message Passing Networks into the Transformer-style architecture to
deliver a class of more expressive encoders of molecules. The flexibility of
GROVER allows it to be trained efficiently on large-scale molecular dataset
without requiring any supervision, thus being immunized to the two issues
mentioned above. We pre-train GROVER with 100 million parameters on 10 million
unlabelled molecules -- the biggest GNN and the largest training dataset in
molecular representation learning. We then leverage the pre-trained GROVER for
molecular property prediction followed by task-specific fine-tuning, where we
observe a huge improvement (more than 6% on average) from current
state-of-the-art methods on 11 challenging benchmarks. The insights we gained
are that well-designed self-supervision losses and largely-expressive
pre-trained models enjoy the significant potential on performance boosting.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Bi-level Contrastive Learning for Knowledge-Enhanced Molecule
Representations [55.42602325017405]
We propose a novel method called GODE, which takes into account the two-level structure of individual molecules.
By pre-training two graph neural networks (GNNs) on different graph structures, combined with contrastive learning, GODE fuses molecular structures with their corresponding knowledge graph substructures.
When fine-tuned across 11 chemical property tasks, our model outperforms existing benchmarks, registering an average ROC-AUC uplift of 13.8% for classification tasks and an average RMSE/MAE enhancement of 35.1% for regression tasks.
arXiv Detail & Related papers (2023-06-02T15:49:45Z) - GraphGANFed: A Federated Generative Framework for Graph-Structured
Molecules Towards Efficient Drug Discovery [2.309914459672556]
We propose a Graph convolutional network in Generative Adversarial Networks via Federated learning (GraphGANFed) framework to generate novel molecules without sharing local data sets.
The molecules generated by GraphGANFed can achieve high novelty (=100) and diversity (> 0.9)
arXiv Detail & Related papers (2023-04-11T21:15:28Z) - MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular
Representation Learning [77.31492888819935]
We propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT)
MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt.
Experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction.
arXiv Detail & Related papers (2022-12-20T19:32:30Z) - BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular
Representation [21.03650456372902]
We propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations.
BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges.
It achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction.
arXiv Detail & Related papers (2022-11-25T09:44:28Z) - Chemical-Reaction-Aware Molecule Representation Learning [88.79052749877334]
We propose using chemical reactions to assist learning molecule representation.
Our approach is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings.
Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks.
arXiv Detail & Related papers (2021-09-21T00:08:43Z) - Learning Attributed Graph Representations with Communicative Message
Passing Transformer [3.812358821429274]
We propose a Communicative Message Passing Transformer (CoMPT) neural network to improve the molecular graph representation.
Unlike the previous transformer-style GNNs that treat molecules as fully connected graphs, we introduce a message diffusion mechanism to leverage the graph connectivity inductive bias.
arXiv Detail & Related papers (2021-07-19T11:58:32Z) - Learn molecular representations from large-scale unlabeled molecules for
drug discovery [19.222413268610808]
Molecular Pre-training Graph-based deep learning framework, named MPG, leans molecular representations from large-scale unlabeled molecules.
MolGNet can capture valuable chemistry insights to produce interpretable representation.
MPG is promising to become a novel approach in the drug discovery pipeline.
arXiv Detail & Related papers (2020-12-21T08:21:49Z) - ASGN: An Active Semi-supervised Graph Neural Network for Molecular
Property Prediction [61.33144688400446]
We propose a novel framework called Active Semi-supervised Graph Neural Network (ASGN) by incorporating both labeled and unlabeled molecules.
In the teacher model, we propose a novel semi-supervised learning method to learn general representation that jointly exploits information from molecular structure and molecular distribution.
At last, we proposed a novel active learning strategy in terms of molecular diversities to select informative data during the whole framework learning.
arXiv Detail & Related papers (2020-07-07T04:22:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.