Related papers: AugLiChem: Data Augmentation Library of Chemical Structures for Machine Learning

AugLiChem: Data Augmentation Library of Chemical Structures for Machine Learning

URL: http://arxiv.org/abs/2111.15112v2
Date: Wed, 1 Dec 2021 21:04:43 GMT
Title: AugLiChem: Data Augmentation Library of Chemical Structures for Machine Learning
Authors: Rishikesh Magar, Yuyang Wang, Cooper Lorsung, Chen Liang, Hariharan Ramasubramanian, Peiyuan Li and Amir Barati Farimani
Abstract summary: AugLiChem is the data augmentation library for chemical structures. Augmentation methods for both crystalline systems and molecules are introduced. We show that using our augmentation strategies significantly improves the performance of ML models.
Score: 12.864696894234715
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning (ML) has demonstrated the promise for accurate and efficient property prediction of molecules and crystalline materials. To develop highly accurate ML models for chemical structure property prediction, datasets with sufficient samples are required. However, obtaining clean and sufficient data of chemical properties can be expensive and time-consuming, which greatly limits the performance of ML models. Inspired by the success of data augmentations in computer vision and natural language processing, we developed AugLiChem: the data augmentation library for chemical structures. Augmentation methods for both crystalline systems and molecules are introduced, which can be utilized for fingerprint-based ML models and Graph Neural Networks(GNNs). We show that using our augmentation strategies significantly improves the performance of ML models, especially when using GNNs. In addition, the augmentations that we developed can be used as a direct plug-in module during training and have demonstrated the effectiveness when implemented with different GNN models through the AugliChem library. The Python-based package for our implementation of Auglichem: Data augmentation library for chemical structures, is publicly available at: https://github.com/BaratiLab/AugLiChem.

Related papers

ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z)
Mic-hackathon 2024: Hackathon on Machine Learning for Electron and Scanning Probe Microscopy [54.24356756795849]
Microscopy is a primary source of information on materials structure and functionality at nanometer and atomic scales.<n>The adoption of Data Management Plans (DMPs) by major funding agencies promotes preservation and access.<n> deriving insights remains difficult due to the lack of standardized code ecosystems, benchmarks, and integration strategies.
arXiv Detail & Related papers (2025-06-10T03:54:36Z)
Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z)
Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction. Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations. On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z)
Synthetic data enable experiments in atomistic machine learning [0.0]
We demonstrate the use of a large dataset labelled with per-atom energies from an existing ML potential model. The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints. We show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets.
arXiv Detail & Related papers (2022-11-29T18:17:24Z)
Augmenting Interpretable Models with LLMs during Training [73.40079895413861]
We propose Augmented Interpretable Models (Aug-imodels) to build efficient and interpretable models. Aug-imodels use LLMs during fitting but not during inference, allowing complete transparency. We explore two instantiations of Aug-imodels in natural-language processing: (i) Aug-GAM, which augments a generalized additive model with decoupled embeddings from an LLM and (ii) Aug-Tree, which augments a decision tree with LLM feature expansions.
arXiv Detail & Related papers (2022-09-23T18:36:01Z)
MolGraph: a Python package for the implementation of molecular graphs and graph neural networks with TensorFlow and Keras [51.92255321684027]
MolGraph is a graph neural network (GNN) package for molecular machine learning (ML) MolGraph implements a chemistry module to accommodate the generation of small molecular graphs, which can be passed to a GNN algorithm to solve a molecular ML problem. GNNs proved useful for molecular identification and improved interpretability of chromatographic retention time data.
arXiv Detail & Related papers (2022-08-21T18:37:41Z)
Crystal Twins: Self-supervised Learning for Crystalline Material Property Prediction [8.048439531116367]
We introduce Crystal Twins (CT): an SSL method for crystalline materials property prediction. We pre-train a Graph Neural Network (GNN) by applying the redundancy reduction principle to the graph latent embeddings of augmented instances. By sharing the pre-trained weights when fine-tuning the GNN for regression tasks, we significantly improve the performance for 7 challenging material property prediction benchmarks.
arXiv Detail & Related papers (2022-05-04T05:08:46Z)
Chemical-Reaction-Aware Molecule Representation Learning [88.79052749877334]
We propose using chemical reactions to assist learning molecule representation. Our approach is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings. Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks.
arXiv Detail & Related papers (2021-09-21T00:08:43Z)
DGL-LifeSci: An Open-Source Toolkit for Deep Learning on Graphs in Life Science [5.3825788156200565]
We present DGL-LifeSci, an open-source package for deep learning on graphs in life science. DGL-LifeSci is a python toolkit based on RDKit, PyTorch and Deep Graph Library. It allows GNN-based modeling on custom datasets for molecular property prediction, reaction prediction and molecule generation.
arXiv Detail & Related papers (2021-06-27T13:27:47Z)
Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z)
A Universal Framework for Featurization of Atomistic Systems [0.0]
Reactive force fields based on physics or machine learning can be used to bridge the gap in time and length scales. We introduce the Gaussian multi-pole (GMP) featurization scheme that utilizes physically-relevant multi-pole expansions of the electron density around atoms. We demonstrate that GMP-based models can achieve chemical accuracy for the QM9 dataset, and their accuracy remains reasonable even when extrapolating to new elements.
arXiv Detail & Related papers (2021-02-04T03:11:00Z)
ML4Chem: A Machine Learning Package for Chemistry and Materials Science [0.0]
ML4Chem is an open-source machine learning library for chemistry and materials science. It provides an extendable platform to develop and deploy machine learning models and pipelines. Here we introduce its atomistic module for the implementation, deployment, and inference.
arXiv Detail & Related papers (2020-03-02T00:28:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.