AugLiChem: Data Augmentation Library of Chemical Structures for Machine
Learning
- URL: http://arxiv.org/abs/2111.15112v2
- Date: Wed, 1 Dec 2021 21:04:43 GMT
- Title: AugLiChem: Data Augmentation Library of Chemical Structures for Machine
Learning
- Authors: Rishikesh Magar, Yuyang Wang, Cooper Lorsung, Chen Liang, Hariharan
Ramasubramanian, Peiyuan Li and Amir Barati Farimani
- Abstract summary: AugLiChem is the data augmentation library for chemical structures.
Augmentation methods for both crystalline systems and molecules are introduced.
We show that using our augmentation strategies significantly improves the performance of ML models.
- Score: 12.864696894234715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) has demonstrated the promise for accurate and efficient
property prediction of molecules and crystalline materials. To develop highly
accurate ML models for chemical structure property prediction, datasets with
sufficient samples are required. However, obtaining clean and sufficient data
of chemical properties can be expensive and time-consuming, which greatly
limits the performance of ML models. Inspired by the success of data
augmentations in computer vision and natural language processing, we developed
AugLiChem: the data augmentation library for chemical structures. Augmentation
methods for both crystalline systems and molecules are introduced, which can be
utilized for fingerprint-based ML models and Graph Neural Networks(GNNs). We
show that using our augmentation strategies significantly improves the
performance of ML models, especially when using GNNs. In addition, the
augmentations that we developed can be used as a direct plug-in module during
training and have demonstrated the effectiveness when implemented with
different GNN models through the AugliChem library. The Python-based package
for our implementation of Auglichem: Data augmentation library for chemical
structures, is publicly available at: https://github.com/BaratiLab/AugLiChem.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - Synthetic data enable experiments in atomistic machine learning [0.0]
We demonstrate the use of a large dataset labelled with per-atom energies from an existing ML potential model.
The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints.
We show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets.
arXiv Detail & Related papers (2022-11-29T18:17:24Z) - Augmenting Interpretable Models with LLMs during Training [73.40079895413861]
We propose Augmented Interpretable Models (Aug-imodels) to build efficient and interpretable models.
Aug-imodels use LLMs during fitting but not during inference, allowing complete transparency.
We explore two instantiations of Aug-imodels in natural-language processing: (i) Aug-GAM, which augments a generalized additive model with decoupled embeddings from an LLM and (ii) Aug-Tree, which augments a decision tree with LLM feature expansions.
arXiv Detail & Related papers (2022-09-23T18:36:01Z) - MolGraph: a Python package for the implementation of molecular graphs
and graph neural networks with TensorFlow and Keras [51.92255321684027]
MolGraph is a graph neural network (GNN) package for molecular machine learning (ML)
MolGraph implements a chemistry module to accommodate the generation of small molecular graphs, which can be passed to a GNN algorithm to solve a molecular ML problem.
GNNs proved useful for molecular identification and improved interpretability of chromatographic retention time data.
arXiv Detail & Related papers (2022-08-21T18:37:41Z) - Crystal Twins: Self-supervised Learning for Crystalline Material
Property Prediction [8.048439531116367]
We introduce Crystal Twins (CT): an SSL method for crystalline materials property prediction.
We pre-train a Graph Neural Network (GNN) by applying the redundancy reduction principle to the graph latent embeddings of augmented instances.
By sharing the pre-trained weights when fine-tuning the GNN for regression tasks, we significantly improve the performance for 7 challenging material property prediction benchmarks.
arXiv Detail & Related papers (2022-05-04T05:08:46Z) - Chemical-Reaction-Aware Molecule Representation Learning [88.79052749877334]
We propose using chemical reactions to assist learning molecule representation.
Our approach is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings.
Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks.
arXiv Detail & Related papers (2021-09-21T00:08:43Z) - DGL-LifeSci: An Open-Source Toolkit for Deep Learning on Graphs in Life
Science [5.3825788156200565]
We present DGL-LifeSci, an open-source package for deep learning on graphs in life science.
DGL-LifeSci is a python toolkit based on RDKit, PyTorch and Deep Graph Library.
It allows GNN-based modeling on custom datasets for molecular property prediction, reaction prediction and molecule generation.
arXiv Detail & Related papers (2021-06-27T13:27:47Z) - Scientific Language Models for Biomedical Knowledge Base Completion: An
Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction.
We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z) - A Universal Framework for Featurization of Atomistic Systems [0.0]
Reactive force fields based on physics or machine learning can be used to bridge the gap in time and length scales.
We introduce the Gaussian multi-pole (GMP) featurization scheme that utilizes physically-relevant multi-pole expansions of the electron density around atoms.
We demonstrate that GMP-based models can achieve chemical accuracy for the QM9 dataset, and their accuracy remains reasonable even when extrapolating to new elements.
arXiv Detail & Related papers (2021-02-04T03:11:00Z) - ML4Chem: A Machine Learning Package for Chemistry and Materials Science [0.0]
ML4Chem is an open-source machine learning library for chemistry and materials science.
It provides an extendable platform to develop and deploy machine learning models and pipelines.
Here we introduce its atomistic module for the implementation, deployment, and inference.
arXiv Detail & Related papers (2020-03-02T00:28:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.