Data-Efficient Graph Grammar Learning for Molecular Generation
- URL: http://arxiv.org/abs/2203.08031v1
- Date: Tue, 15 Mar 2022 16:14:30 GMT
- Title: Data-Efficient Graph Grammar Learning for Molecular Generation
- Authors: Minghao Guo, Veronika Thost, Beichen Li, Payel Das, Jie Chen, Wojciech
Matusik
- Abstract summary: We propose a data-efficient generative model that can be learned from datasets with orders of smaller magnitude sizes than common benchmarks.
Our learned graph grammar yields state-of-the-art results on generating high-quality molecules for three monomer datasets.
Our approach also achieves remarkable performance in a challenging polymer generation task with only $117$ training samples.
- Score: 41.936515793383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of molecular generation has received significant attention
recently. Existing methods are typically based on deep neural networks and
require training on large datasets with tens of thousands of samples. In
practice, however, the size of class-specific chemical datasets is usually
limited (e.g., dozens of samples) due to labor-intensive experimentation and
data collection. This presents a considerable challenge for the deep learning
generative models to comprehensively describe the molecular design space.
Another major challenge is to generate only physically synthesizable molecules.
This is a non-trivial task for neural network-based generative models since the
relevant chemical knowledge can only be extracted and generalized from the
limited training data. In this work, we propose a data-efficient generative
model that can be learned from datasets with orders of magnitude smaller sizes
than common benchmarks. At the heart of this method is a learnable graph
grammar that generates molecules from a sequence of production rules. Without
any human assistance, these production rules are automatically constructed from
training data. Furthermore, additional chemical knowledge can be incorporated
in the model by further grammar optimization. Our learned graph grammar yields
state-of-the-art results on generating high-quality molecules for three monomer
datasets that contain only ${\sim}20$ samples each. Our approach also achieves
remarkable performance in a challenging polymer generation task with only $117$
training samples and is competitive against existing methods using $81$k data
points. Code is available at https://github.com/gmh14/data_efficient_grammar.
Related papers
- Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model [22.368332915420606]
Unified Text-Graph Diffusion Model (UTGDiff) is a framework to generate molecular graphs from instructions.
UTGDiff features a unified text-graph transformer as the denoising network, derived from pre-trained language models.
Our experimental results demonstrate that UTGDiff consistently outperforms sequence-based baselines in tasks involving instruction-based molecule generation and editing.
arXiv Detail & Related papers (2024-08-19T11:09:15Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - Hierarchical Grammar-Induced Geometry for Data-Efficient Molecular
Property Prediction [37.443491843178315]
We propose a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar.
The property prediction is performed using graph neural diffusion over the grammar-induced geometry.
We include a detailed ablation study and further analysis of our solution, showing its effectiveness in cases with extremely limited data.
arXiv Detail & Related papers (2023-09-04T19:59:51Z) - MolGrapher: Graph-based Visual Recognition of Chemical Structures [50.13749978547401]
We introduce MolGrapher to recognize chemical structures visually.
We treat all candidate atoms and bonds as nodes and put them in a graph.
We classify atom and bond nodes in the graph with a Graph Neural Network.
arXiv Detail & Related papers (2023-08-23T16:16:11Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Keeping it Simple: Language Models can learn Complex Molecular
Distributions [0.0]
We introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules.
The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions.
arXiv Detail & Related papers (2021-12-06T13:40:58Z) - Learn molecular representations from large-scale unlabeled molecules for
drug discovery [19.222413268610808]
Molecular Pre-training Graph-based deep learning framework, named MPG, leans molecular representations from large-scale unlabeled molecules.
MolGNet can capture valuable chemistry insights to produce interpretable representation.
MPG is promising to become a novel approach in the drug discovery pipeline.
arXiv Detail & Related papers (2020-12-21T08:21:49Z) - Advanced Graph and Sequence Neural Networks for Molecular Property
Prediction and Drug Discovery [53.00288162642151]
We develop MoleculeKit, a suite of comprehensive machine learning tools spanning different computational models and molecular representations.
Built on these representations, MoleculeKit includes both deep learning and traditional machine learning methods for graph and sequence data.
Results on both online and offline antibiotics discovery and molecular property prediction tasks show that MoleculeKit achieves consistent improvements over prior methods.
arXiv Detail & Related papers (2020-12-02T02:09:31Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.