Synthetic data enable experiments in atomistic machine learning
- URL: http://arxiv.org/abs/2211.16443v1
- Date: Tue, 29 Nov 2022 18:17:24 GMT
- Title: Synthetic data enable experiments in atomistic machine learning
- Authors: John L. A. Gardner and Zo\'e Faure Beaulieu and Volker L. Deringer
- Abstract summary: We demonstrate the use of a large dataset labelled with per-atom energies from an existing ML potential model.
The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints.
We show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine-learning models are increasingly used to predict properties of atoms
in chemical systems. There have been major advances in developing descriptors
and regression frameworks for this task, typically starting from (relatively)
small sets of quantum-mechanical reference data. Larger datasets of this kind
are becoming available, but remain expensive to generate. Here we demonstrate
the use of a large dataset that we have "synthetically" labelled with per-atom
energies from an existing ML potential model. The cheapness of this process,
compared to the quantum-mechanical ground truth, allows us to generate millions
of datapoints, in turn enabling rapid experimentation with atomistic ML models
from the small- to the large-data regime. This approach allows us here to
compare regression frameworks in depth, and to explore visualisation based on
learned representations. We also show that learning synthetic data labels can
be a useful pre-training task for subsequent fine-tuning on small datasets. In
the future, we expect that our open-sourced dataset, and similar ones, will be
useful in rapidly exploring deep-learning models in the limit of abundant
chemical data.
Related papers
- Transfer Learning for Molecular Property Predictions from Small Data Sets [0.0]
We benchmark common machine learning models for the prediction of molecular properties on two small data sets.
We present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets.
arXiv Detail & Related papers (2024-04-20T14:25:34Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - Synthetic pre-training for neural-network interatomic potentials [0.0]
We show that synthetic atomistic data, themselves obtained at scale with an existing machine learning potential, constitute a useful pre-training task for neural-network interatomic potential models.
Once pre-trained with a large synthetic dataset, these models can be fine-tuned on a much smaller, quantum-mechanical one, improving numerical accuracy and stability in computational practice.
arXiv Detail & Related papers (2023-07-24T17:16:24Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Calibration and generalizability of probabilistic models on low-data
chemical datasets with DIONYSUS [0.0]
We perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets.
We analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets.
We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments.
arXiv Detail & Related papers (2022-12-03T08:19:06Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Bridge Data Center AI Systems with Edge Computing for Actionable
Information Retrieval [0.5652468989804973]
High data rates at modern synchrotron and X-ray free-electron lasers motivate the use of machine learning methods for data reduction, feature detection, and other purposes.
We describe here how specialized data center AI systems can be used for this purpose.
arXiv Detail & Related papers (2021-05-28T16:47:01Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Forecasting Industrial Aging Processes with Machine Learning Methods [0.0]
We evaluate a wider range of data-driven models, comparing some traditional stateless models to more complex recurrent neural networks.
Our results show that recurrent models produce near perfect predictions when trained on larger datasets.
arXiv Detail & Related papers (2020-02-05T13:06:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.