Calibration and generalizability of probabilistic models on low-data
chemical datasets with DIONYSUS
- URL: http://arxiv.org/abs/2212.01574v2
- Date: Tue, 6 Dec 2022 18:37:51 GMT
- Title: Calibration and generalizability of probabilistic models on low-data
chemical datasets with DIONYSUS
- Authors: Gary Tom, Riley J. Hickman, Aniket Zinzuwadia, Afshan Mohajeri,
Benjamin Sanchez-Lengeling, Alan Aspuru-Guzik
- Abstract summary: We perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets.
We analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets.
We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models that leverage large datasets are often the state of the
art for modelling molecular properties. When the datasets are smaller (< 2000
molecules), it is not clear that deep learning approaches are the right
modelling tool. In this work we perform an extensive study of the calibration
and generalizability of probabilistic machine learning models on small chemical
datasets. Using different molecular representations and models, we analyse the
quality of their predictions and uncertainties in a variety of tasks (binary,
regression) and datasets. We also introduce two simulated experiments that
evaluate their performance: (1) Bayesian optimization guided molecular design,
(2) inference on out-of-distribution data via ablated cluster splits. We offer
practical insights into model and feature choice for modelling small chemical
datasets, a common scenario in new chemical experiments. We have packaged our
analysis into the DIONYSUS repository, which is open sourced to aid in
reproducibility and extension to new datasets.
Related papers
- A survey of probabilistic generative frameworks for molecular simulations [0.0]
Generative artificial intelligence is now a widely used tool in molecular science.
We introduce and explain several classes of generative models, broadly sorted into two categories: flow-based models and diffusion models.
We examine their accuracy, computational cost, and generation speed across datasets with tunable dimensionality, complexity, and modal asymmetry.
arXiv Detail & Related papers (2024-11-14T12:05:08Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Unraveling Key Elements Underlying Molecular Property Prediction: A
Systematic Study [27.56700461408765]
Key elements underlying molecular property prediction remain largely unexplored.
We conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets.
In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4,200 models on SMILES sequences and 8,400 models on molecular graphs.
arXiv Detail & Related papers (2022-09-26T14:07:59Z) - Molecular Attributes Transfer from Non-Parallel Data [57.010952598634944]
We formulate molecular optimization as a style transfer problem and present a novel generative model that could automatically learn internal differences between two groups of non-parallel data.
Experiments on two molecular optimization tasks, toxicity modification and synthesizability improvement, demonstrate that our model significantly outperforms several state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T06:10:22Z) - Size doesn't matter: predicting physico- or biochemical properties based
on dozens of molecules [0.0]
The paper shows a significant improvement in the performance of models for target properties with a lack of data.
The effects of the dataset composition on model quality and the applicability domain of the resulting models are also considered.
arXiv Detail & Related papers (2021-07-22T18:57:24Z) - Model-agnostic multi-objective approach for the evolutionary discovery
of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results.
We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z) - Kernel-Based Models for Influence Maximization on Graphs based on
Gaussian Process Variance Minimization [9.357483974291899]
We introduce and investigate a novel model for influence (IM) on graphs.
Data-driven approaches can be applied to determine proper kernels for this IM model.
Compared to models in this field that rely on costly Monte-Carlo simulations, our model allows for a simple and cost-efficient update strategy.
arXiv Detail & Related papers (2021-03-02T08:55:34Z) - Learning Neural Generative Dynamics for Molecular Conformation
Generation [89.03173504444415]
We study how to generate molecule conformations (textiti.e., 3D structures) from a molecular graph.
We propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph.
arXiv Detail & Related papers (2021-02-20T03:17:58Z) - Predicting Chemical Properties using Self-Attention Multi-task Learning
based on SMILES Representation [0.0]
In this study, we explore the structural differences of the transformer-variant model and proposed a new self-attention based model.
The representation learning performance of the self-attention module was evaluated in a multi-task learning environment using imbalanced chemical datasets.
arXiv Detail & Related papers (2020-10-19T09:46:50Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.