Related papers: Building Robust Machine Learning Models for Small Chemical Science Data: The Case of Shear Viscosity

Building Robust Machine Learning Models for Small Chemical Science Data: The Case of Shear Viscosity

URL: http://arxiv.org/abs/2208.10784v1
Date: Tue, 23 Aug 2022 07:33:14 GMT
Title: Building Robust Machine Learning Models for Small Chemical Science Data: The Case of Shear Viscosity
Authors: Nikhil V. S. Avula and Shivanand K. Veesam and Sudarshan Behera and Sundaram Balasubramanian
Abstract summary: We train several Machine Learning models to predict the shear viscosity of a Lennard-Jones (LJ) fluid. Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated.
Score: 3.4761212729163313
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Shear viscosity, though being a fundamental property of all liquids, is computationally expensive to estimate from equilibrium molecular dynamics simulations. Recently, Machine Learning (ML) methods have been used to augment molecular simulations in many contexts, thus showing promise to estimate viscosity too in a relatively inexpensive manner. However, ML methods face significant challenges like overfitting when the size of the data set is small, as is the case with viscosity. In this work, we train several ML models to predict the shear viscosity of a Lennard-Jones (LJ) fluid, with particular emphasis on addressing issues arising from a small data set. Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated. First, we show that the widely used performance estimation procedure of using a single unseen data set shows a wide variability on small data sets. In this context, the common practice of using Cross validation (CV) to select the hyperparameters (model selection) can be adapted to estimate the generalization error (performance estimation) as well. We compare two simple CV procedures for their ability to do both model selection and performance estimation, and find that k-fold CV based procedure shows a lower variance of error estimates. We discuss the role of performance metrics in training and evaluation. Finally, Gaussian Process Regression (GPR) and ensemble methods were used to estimate the uncertainty on individual predictions. The uncertainty estimates from GPR were also used to construct an applicability domain using which the ML models provided more reliable predictions on another small data set generated in this work. Overall, the procedures prescribed in this work, together, lead to robust ML models for small data sets.

Related papers

DoMINO: A Decomposable Multi-scale Iterative Neural Operator for Modeling Large Scale Engineering Simulations [2.300471499347615]
DoMINO is a point cloudbased machine learning model that uses local geometric information to predict flow fields on discrete points. DoMINO is validated for the automotive aerodynamics use case using the DrivAerML dataset.
arXiv Detail & Related papers (2025-01-23T03:28:10Z)
Model aggregation: minimizing empirical variance outperforms minimizing empirical error [0.29008108937701327]
We propose a data-driven framework that aggregates predictions from diverse models into a single, more accurate output. It is non-intrusive - treating models as black-box functions - model-agnostic, requires minimal assumptions, and can combine outputs from a wide range of models. We show how it successfully integrates traditional solvers with machine learning models to improve both robustness and accuracy.
arXiv Detail & Related papers (2024-09-25T18:33:21Z)
Measuring Variable Importance in Individual Treatment Effect Estimation with High Dimensional Data [35.104681814241104]
Causal machine learning (ML) promises to provide powerful tools for estimating individual treatment effects. ML methods still face the significant challenge of interpretability, which is crucial for medical applications. We propose a new algorithm based on the Conditional Permutation Importance (CPI) method for statistically rigorous variable importance assessment.
arXiv Detail & Related papers (2024-08-23T11:44:07Z)
Accelerated training of deep learning surrogate models for surface displacement and flow, with application to MCMC-based history matching of CO2 storage operations [0.0]
We introduce a new surrogate modeling framework to predict CO2 saturation, pressure and surface displacement for use in the history matching of carbon storage operations. Training here involves a large number of inexpensive flow-only simulations combined with a much smaller number of coupled runs.
arXiv Detail & Related papers (2024-08-20T10:31:52Z)
Analytical results for uncertainty propagation through trained machine learning regression models [0.10878040851637999]
This paper addresses the challenge of uncertainty propagation through trained/fixed machine learning (ML) regression models. We present numerical experiments in which we validate our methods and compare them with a Monte Carlo approach from a computational efficiency point of view.
arXiv Detail & Related papers (2024-04-17T10:16:20Z)
Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16 [0.29998889086656577]
We show that relatively minor modifications on a benchmark dataset cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies.
arXiv Detail & Related papers (2023-05-31T12:03:12Z)
A prediction and behavioural analysis of machine learning methods for modelling travel mode choice [0.26249027950824505]
We conduct a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice. Results indicate that the models with the highest disaggregate predictive performance provide poorer estimates of behavioural indicators and aggregate mode shares. It is also observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
arXiv Detail & Related papers (2023-01-11T11:10:32Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
Prediction of liquid fuel properties using machine learning models with Gaussian processes and probabilistic conditional generative learning [56.67751936864119]
The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels. Those models can be trained using the database from MD simulations and/or experimental measurements in a data-fusion-fidelity approach. The results show that ML models can predict accurately the fuel properties of a wide range of pressure and temperature conditions.
arXiv Detail & Related papers (2021-10-18T14:43:50Z)
X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning. To take the power of both worlds, we propose a novel X-model. X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z)
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets. We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference. Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances. We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.