Dual-view Molecule Pre-training
- URL: http://arxiv.org/abs/2106.10234v1
- Date: Thu, 17 Jun 2021 03:58:38 GMT
- Title: Dual-view Molecule Pre-training
- Authors: Jinhua Zhu, Yingce Xia, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan
Liu
- Abstract summary: Dual-view molecule pre-training can effectively combine the strengths of both types of molecule representations.
DMP is tested on nine molecular property prediction tasks and achieves state-of-the-art performances on seven of them.
- Score: 186.07333992384287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by its success in natural language processing and computer vision,
pre-training has attracted substantial attention in cheminformatics and
bioinformatics, especially for molecule based tasks. A molecule can be
represented by either a graph (where atoms are connected by bonds) or a SMILES
sequence (where depth-first-search is applied to the molecular graph with
specific rules). Existing works on molecule pre-training use either graph
representations only or SMILES representations only. In this work, we propose
to leverage both the representations and design a new pre-training algorithm,
dual-view molecule pre-training (briefly, DMP), that can effectively combine
the strengths of both types of molecule representations. The model of DMP
consists of two branches: a Transformer branch that takes the SMILES sequence
of a molecule as input, and a GNN branch that takes a molecular graph as input.
The training of DMP contains three tasks: (1) predicting masked tokens in a
SMILES sequence by the Transformer branch, (2) predicting masked atoms in a
molecular graph by the GNN branch, and (3) maximizing the consistency between
the two high-level representations output by the Transformer and GNN branches
separately. After pre-training, we can use either the Transformer branch (this
one is recommended according to empirical results), the GNN branch, or both for
downstream tasks. DMP is tested on nine molecular property prediction tasks and
achieves state-of-the-art performances on seven of them. Furthermore, we test
DMP on three retrosynthesis tasks and achieve state-of-the-result on the
USPTO-full dataset. Our code will be released soon.
Related papers
- MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning [17.93173928602627]
We propose a simple transformer-based baseline for multimodal molecular representation learning.
We integrate three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules.
Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets.
arXiv Detail & Related papers (2024-10-10T14:36:58Z) - Molecular Property Prediction Based on Graph Structure Learning [29.516479802217205]
We propose a graph structure learning (GSL) based MPP approach, called GSL-MPP.
Specifically, we first apply graph neural network (GNN) over molecular graphs to extract molecular representations.
With molecular fingerprints, we construct a molecular similarity graph (MSG)
arXiv Detail & Related papers (2023-12-28T06:45:13Z) - Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules [81.05116895430375]
Masked graph modeling excels in the self-supervised representation learning of molecular graphs.
We show that a subgraph-level tokenizer and a sufficiently expressive decoder with remask decoding have a large impact on the encoder's representation learning.
We propose a novel MGM method SimSGT, featuring a Simple GNN-based Tokenizer (SGT) and an effective decoding strategy.
arXiv Detail & Related papers (2023-10-23T09:40:30Z) - Geometry-aware Line Graph Transformer Pre-training for Molecular
Property Prediction [4.598522704308923]
Geometry-aware line graph transformer (Galformer) pre-training is a novel self-supervised learning framework.
Galformer consistently outperforms all baselines on both classification and regression tasks.
arXiv Detail & Related papers (2023-09-01T14:20:48Z) - Bi-level Contrastive Learning for Knowledge-Enhanced Molecule
Representations [55.42602325017405]
We propose a novel method called GODE, which takes into account the two-level structure of individual molecules.
By pre-training two graph neural networks (GNNs) on different graph structures, combined with contrastive learning, GODE fuses molecular structures with their corresponding knowledge graph substructures.
When fine-tuned across 11 chemical property tasks, our model outperforms existing benchmarks, registering an average ROC-AUC uplift of 13.8% for classification tasks and an average RMSE/MAE enhancement of 35.1% for regression tasks.
arXiv Detail & Related papers (2023-06-02T15:49:45Z) - BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular
Representation [21.03650456372902]
We propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations.
BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges.
It achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction.
arXiv Detail & Related papers (2022-11-25T09:44:28Z) - One Transformer Can Understand Both 2D & 3D Molecular Data [94.93514673086631]
We develop a novel Transformer-based Molecular model called Transformer-M.
It can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations.
All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks.
arXiv Detail & Related papers (2022-10-04T17:30:31Z) - Chemical-Reaction-Aware Molecule Representation Learning [88.79052749877334]
We propose using chemical reactions to assist learning molecule representation.
Our approach is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings.
Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks.
arXiv Detail & Related papers (2021-09-21T00:08:43Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.