On Data Imbalance in Molecular Property Prediction with Pre-training
- URL: http://arxiv.org/abs/2308.08934v1
- Date: Thu, 17 Aug 2023 12:04:14 GMT
- Title: On Data Imbalance in Molecular Property Prediction with Pre-training
- Authors: Limin Wang, Masatoshi Hanai, Toyotaro Suzumura, Shun Takashige,
Kenjiro Taura
- Abstract summary: A technique called pre-training is used to improve the accuracy of machine learning models.
Pre-training involves training the model on pretext task, which is different from the target task, before training the model on the target task.
In this study, we propose an effective pre-training method that addresses the imbalance in input data.
- Score: 16.211138511816642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Revealing and analyzing the various properties of materials is an essential
and critical issue in the development of materials, including batteries,
semiconductors, catalysts, and pharmaceuticals. Traditionally, these properties
have been determined through theoretical calculations and simulations. However,
it is not practical to perform such calculations on every single candidate
material. Recently, a combination method of the theoretical calculation and
machine learning has emerged, that involves training machine learning models on
a subset of theoretical calculation results to construct a surrogate model that
can be applied to the remaining materials. On the other hand, a technique
called pre-training is used to improve the accuracy of machine learning models.
Pre-training involves training the model on pretext task, which is different
from the target task, before training the model on the target task. This
process aims to extract the input data features, stabilizing the learning
process and improving its accuracy. However, in the case of molecular property
prediction, there is a strong imbalance in the distribution of input data and
features, which may lead to biased learning towards frequently occurring data
during pre-training. In this study, we propose an effective pre-training method
that addresses the imbalance in input data. We aim to improve the final
accuracy by modifying the loss function of the existing representative
pre-training method, node masking, to compensate the imbalance. We have
investigated and assessed the impact of our proposed imbalance compensation on
pre-training and the final prediction accuracy through experiments and
evaluations using benchmark of molecular property prediction models.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Machine learning for accuracy in density functional approximations [0.0]
Recent progress in applying machine learning to improve the accuracy of density functional approximations is reviewed.
Promises and challenges in devising machine learning models transferable between different chemistries and materials classes are discussed.
arXiv Detail & Related papers (2023-11-01T00:02:09Z) - Task-Aware Machine Unlearning and Its Application in Load Forecasting [4.00606516946677]
This paper introduces the concept of machine unlearning which is specifically designed to remove the influence of part of the dataset on an already trained forecaster.
A performance-aware algorithm is proposed by evaluating the sensitivity of local model parameter change using influence function and sample re-weighting.
We tested the unlearning algorithms on linear, CNN, andMixer based load forecasters with a realistic load dataset.
arXiv Detail & Related papers (2023-08-28T08:50:12Z) - Is Self-Supervised Pretraining Good for Extrapolation in Molecular
Property Prediction? [16.211138511816642]
In material science, the prediction of unobserved values, commonly referred to as extrapolation, is critical for property prediction.
We propose an experimental framework for the demonstration and empirically reveal that while models were unable to accurately extrapolate absolute property values, self-supervised pretraining enables them to learn relative tendencies of unobserved property values.
arXiv Detail & Related papers (2023-08-16T03:38:43Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - On the Relation between Prediction and Imputation Accuracy under Missing
Covariates [0.0]
Recent research has realized an increasing trend towards the usage of modern Machine Learning algorithms for imputation.
Recent research has realized an increasing trend towards the usage of modern Machine Learning algorithms for imputation.
arXiv Detail & Related papers (2021-12-09T23:30:44Z) - Predictive machine learning for prescriptive applications: a coupled
training-validating approach [77.34726150561087]
We propose a new method for training predictive machine learning models for prescriptive applications.
This approach is based on tweaking the validation step in the standard training-validating-testing scheme.
Several experiments with synthetic data demonstrate promising results in reducing the prescription costs in both deterministic and real models.
arXiv Detail & Related papers (2021-10-22T15:03:20Z) - Hessian-based toolbox for reliable and interpretable machine learning in
physics [58.720142291102135]
We present a toolbox for interpretability and reliability, extrapolation of the model architecture.
It provides a notion of the influence of the input data on the prediction at a given test point, an estimation of the uncertainty of the model predictions, and an agnostic score for the model predictions.
Our work opens the road to the systematic use of interpretability and reliability methods in ML applied to physics and, more generally, science.
arXiv Detail & Related papers (2021-08-04T16:32:59Z) - Calibrated Uncertainty for Molecular Property Prediction using Ensembles
of Message Passing Neural Networks [11.47132155400871]
We extend a message passing neural network designed specifically for predicting properties of molecules and materials.
We show that our approach results in accurate models for predicting molecular formation energies with calibrated uncertainty.
arXiv Detail & Related papers (2021-07-13T13:28:11Z) - Precise Tradeoffs in Adversarial Training for Linear Regression [55.764306209771405]
We provide a precise and comprehensive understanding of the role of adversarial training in the context of linear regression with Gaussian features.
We precisely characterize the standard/robust accuracy and the corresponding tradeoff achieved by a contemporary mini-max adversarial training approach.
Our theory for adversarial training algorithms also facilitates the rigorous study of how a variety of factors (size and quality of training data, model overparametrization etc.) affect the tradeoff between these two competing accuracies.
arXiv Detail & Related papers (2020-02-24T19:01:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.