Related papers: Machine Learning Force Fields with Data Cost Aware Training

Machine Learning Force Fields with Data Cost Aware Training

URL: http://arxiv.org/abs/2306.03109v1
Date: Mon, 5 Jun 2023 04:34:54 GMT
Title: Machine Learning Force Fields with Data Cost Aware Training
Authors: Alexander Bukharin, Tianyi Liu, Shengjie Wang, Simiao Zuo, Weihao Gao, Wen Yan, Tuo Zhao
Abstract summary: Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels. We propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data.
Score: 94.78998399180519
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation, which finds widespread applications in chemistry and biomedical research. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels generated by expensive quantum mechanical algorithms, which may scale as $O(n^3)$ to $O(n^7)$, with $n$ proportional to the number of basis functions. To address this issue, we propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data. The motivation behind ASTEROID is that inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Therefore, we first train a MLFF model on a large amount of inaccurate training data, employing a bias-aware loss function to prevent the model from overfitting tahe potential bias of this data. We then fine-tune the obtained model using a small amount of accurate training data, which preserves the knowledge learned from the inaccurate training data while significantly improving the model's accuracy. Moreover, we propose a variant of ASTEROID based on score matching for the setting where the inaccurate training data are unlabeled. Extensive experiments on MD datasets and downstream tasks validate the efficacy of ASTEROID. Our code and data are available at https://github.com/abukharin3/asteroid.

Related papers

Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression [0.5831737970661137]
We provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models. Our approach extends the abilities of conventional statistical testing by letting the test-condition'' be any condition to describe a pattern in the data. We provide an open source implementation for demonstration and experiments.
arXiv Detail & Related papers (2025-03-24T09:52:36Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
An Investigation on Machine Learning Predictive Accuracy Improvement and Uncertainty Reduction using VAE-based Data Augmentation [2.517043342442487]
Deep generative learning uses certain ML models to learn the underlying distribution of existing data and generate synthetic samples that resemble the real data. In this study, our objective is to evaluate the effectiveness of data augmentation using variational autoencoder (VAE)-based deep generative models. We investigated whether the data augmentation leads to improved accuracy in the predictions of a deep neural network (DNN) model trained using the augmented data.
arXiv Detail & Related papers (2024-10-24T18:15:48Z)
When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned Potential [0.0]
This study investigates how training data diversity affects the performance of machine-learned interatomic potentials (MLIPs) We employ expert and autonomously generated data to create the training data and fit four force-field variants to subsets of the data. Our findings reveal a critical balance in training data diversity: insufficient diversity hinders generalization, while excessive diversity can exceed the MLIP's learning capacity.
arXiv Detail & Related papers (2024-09-11T20:45:44Z)
Transfer Learning for Molecular Property Predictions from Small Data Sets [0.0]
We benchmark common machine learning models for the prediction of molecular properties on two small data sets. We present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets.
arXiv Detail & Related papers (2024-04-20T14:25:34Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Leveraging variational autoencoders for multiple data imputation [0.5156484100374059]
We investigate the ability of deep models, namely variational autoencoders (VAEs), to account for uncertainty in missing data through multiple imputation strategies. We find that VAEs provide poor empirical coverage of missing data, with underestimation and overconfident imputations. To overcome this, we employ $beta$-VAEs, which viewed from a generalized Bayes framework, provide robustness to model misspecification.
arXiv Detail & Related papers (2022-09-30T08:58:43Z)
Too Fine or Too Coarse? The Goldilocks Composition of Data Complexity for Robust Left-Right Eye-Tracking Classifiers [0.0]
We train machine learning models utilizing a mixed dataset composed of both fine- and coarse-grain data. For our purposes, finer-grain data refers to data collected using more complex methods whereas coarser-grain data refers to data collected using more simple methods.
arXiv Detail & Related papers (2022-08-24T23:18:08Z)
How Much More Data Do I Need? Estimating Requirements for Downstream Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks [51.143054943431665]
We propose Hypergradient Data Relevance Analysis, or HYDRA, which interprets predictions made by deep neural networks (DNNs) as effects of their training data. HYDRA assesses the contribution of training data toward test data points throughout the training trajectory. In addition, we quantitatively demonstrate that HYDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels.
arXiv Detail & Related papers (2021-02-04T10:00:13Z)
AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective. We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.