Building Robust Machine Learning Models for Small Chemical Science Data:
The Case of Shear Viscosity
- URL: http://arxiv.org/abs/2208.10784v1
- Date: Tue, 23 Aug 2022 07:33:14 GMT
- Title: Building Robust Machine Learning Models for Small Chemical Science Data:
The Case of Shear Viscosity
- Authors: Nikhil V. S. Avula and Shivanand K. Veesam and Sudarshan Behera and
Sundaram Balasubramanian
- Abstract summary: We train several Machine Learning models to predict the shear viscosity of a Lennard-Jones (LJ) fluid.
Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated.
- Score: 3.4761212729163313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Shear viscosity, though being a fundamental property of all liquids, is
computationally expensive to estimate from equilibrium molecular dynamics
simulations. Recently, Machine Learning (ML) methods have been used to augment
molecular simulations in many contexts, thus showing promise to estimate
viscosity too in a relatively inexpensive manner. However, ML methods face
significant challenges like overfitting when the size of the data set is small,
as is the case with viscosity. In this work, we train several ML models to
predict the shear viscosity of a Lennard-Jones (LJ) fluid, with particular
emphasis on addressing issues arising from a small data set. Specifically, the
issues related to model selection, performance estimation and uncertainty
quantification were investigated. First, we show that the widely used
performance estimation procedure of using a single unseen data set shows a wide
variability on small data sets. In this context, the common practice of using
Cross validation (CV) to select the hyperparameters (model selection) can be
adapted to estimate the generalization error (performance estimation) as well.
We compare two simple CV procedures for their ability to do both model
selection and performance estimation, and find that k-fold CV based procedure
shows a lower variance of error estimates. We discuss the role of performance
metrics in training and evaluation. Finally, Gaussian Process Regression (GPR)
and ensemble methods were used to estimate the uncertainty on individual
predictions. The uncertainty estimates from GPR were also used to construct an
applicability domain using which the ML models provided more reliable
predictions on another small data set generated in this work. Overall, the
procedures prescribed in this work, together, lead to robust ML models for
small data sets.
Related papers
- Model aggregation: minimizing empirical variance outperforms minimizing
empirical error [0.29008108937701327]
We propose a data-driven framework that aggregates predictions from diverse models into a single, more accurate output.
It is non-intrusive - treating models as black-box functions - model-agnostic, requires minimal assumptions, and can combine outputs from a wide range of models.
We show how it successfully integrates traditional solvers with machine learning models to improve both robustness and accuracy.
arXiv Detail & Related papers (2024-09-25T18:33:21Z) - Measuring Variable Importance in Individual Treatment Effect Estimation with High Dimensional Data [35.104681814241104]
Causal machine learning (ML) promises to provide powerful tools for estimating individual treatment effects.
ML methods still face the significant challenge of interpretability, which is crucial for medical applications.
We propose a new algorithm based on the Conditional Permutation Importance (CPI) method for statistically rigorous variable importance assessment.
arXiv Detail & Related papers (2024-08-23T11:44:07Z) - Accelerated training of deep learning surrogate models for surface displacement and flow, with application to MCMC-based history matching of CO2 storage operations [0.0]
We introduce a new surrogate modeling framework to predict CO2 saturation, pressure and surface displacement for use in the history matching of carbon storage operations.
Training here involves a large number of inexpensive flow-only simulations combined with a much smaller number of coupled runs.
arXiv Detail & Related papers (2024-08-20T10:31:52Z) - Analytical results for uncertainty propagation through trained machine learning regression models [0.10878040851637999]
This paper addresses the challenge of uncertainty propagation through trained/fixed machine learning (ML) regression models.
We present numerical experiments in which we validate our methods and compare them with a Monte Carlo approach from a computational efficiency point of view.
arXiv Detail & Related papers (2024-04-17T10:16:20Z) - A prediction and behavioural analysis of machine learning methods for
modelling travel mode choice [0.26249027950824505]
We conduct a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice.
Results indicate that the models with the highest disaggregate predictive performance provide poorer estimates of behavioural indicators and aggregate mode shares.
It is also observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
arXiv Detail & Related papers (2023-01-11T11:10:32Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Prediction of liquid fuel properties using machine learning models with
Gaussian processes and probabilistic conditional generative learning [56.67751936864119]
The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels.
Those models can be trained using the database from MD simulations and/or experimental measurements in a data-fusion-fidelity approach.
The results show that ML models can predict accurately the fuel properties of a wide range of pressure and temperature conditions.
arXiv Detail & Related papers (2021-10-18T14:43:50Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Localized Debiased Machine Learning: Efficient Inference on Quantile
Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference.
Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances.
We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.