From Prediction to Action: Critical Role of Performance Estimation for
Machine-Learning-Driven Materials Discovery
- URL: http://arxiv.org/abs/2311.15549v2
- Date: Thu, 7 Dec 2023 02:08:13 GMT
- Title: From Prediction to Action: Critical Role of Performance Estimation for
Machine-Learning-Driven Materials Discovery
- Authors: Mario Boley and Felix Luong and Simon Teshuva and Daniel F Schmidt and
Lucas Foppa and Matthias Scheffler
- Abstract summary: We argue that the lack of proper performance estimation methods from pre-computed data collections is a fundamental problem for improving data-driven materials discovery.
We propose a novel such estimator that, in contrast to na"ive reward estimation, successfully predicts Gaussian processes with the "expected improvement" acquisition function.
- Score: 2.3243389656894595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Materials discovery driven by statistical property models is an iterative
decision process, during which an initial data collection is extended with new
data proposed by a model-informed acquisition function--with the goal to
maximize a certain "reward" over time, such as the maximum property value
discovered so far. While the materials science community achieved much progress
in developing property models that predict well on average with respect to the
training distribution, this form of in-distribution performance measurement is
not directly coupled with the discovery reward. This is because an iterative
discovery process has a shifting reward distribution that is
over-proportionally determined by the model performance for exceptional
materials. We demonstrate this problem using the example of bulk modulus
maximization among double perovskite oxides. We find that the in-distribution
predictive performance suggests random forests as superior to Gaussian process
regression, while the results are inverse in terms of the discovery rewards. We
argue that the lack of proper performance estimation methods from pre-computed
data collections is a fundamental problem for improving data-driven materials
discovery, and we propose a novel such estimator that, in contrast to na\"ive
reward estimation, successfully predicts Gaussian processes with the "expected
improvement" acquisition function as the best out of four options in our
demonstrational study for double perovskites. Importantly, it does so without
requiring the over thousand ab initio computations that were needed to confirm
this prediction.
Related papers
- Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - Performative Prediction with Bandit Feedback: Learning through
Reparameterization [25.169419772432796]
We develop a framework that reparametrizes the performative prediction as a function of the induced data distribution.
We provide a regret bound that is sublinear in the total number of performative samples taken and is only in the dimension of the model parameter.
On the application side, we believe our method is useful for large online recommendation systems like YouTube or TokTok.
arXiv Detail & Related papers (2023-05-01T21:31:29Z) - Functional Ensemble Distillation [18.34081591772928]
We investigate how to best distill an ensemble's predictions using an efficient model.
We find that learning the distilled model via a simple augmentation scheme in the form of mixup augmentation significantly boosts the performance.
arXiv Detail & Related papers (2022-06-05T14:07:17Z) - An Empirical Study on Distribution Shift Robustness From the Perspective
of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation.
We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Back2Future: Leveraging Backfill Dynamics for Improving Real-time
Predictions in Future [73.03458424369657]
In real-time forecasting in public health, data collection is a non-trivial and demanding task.
'Backfill' phenomenon and its effect on model performance has been barely studied in the prior literature.
We formulate a novel problem and neural framework Back2Future that aims to refine a given model's predictions in real-time.
arXiv Detail & Related papers (2021-06-08T14:48:20Z) - Bayesian Neural Networks for Virtual Flow Metering: An Empirical Study [0.0]
We contribute to the development of data-driven virtual flow meters by presenting a probabilistic VFM based on a Bayesian neural network.
We study the methods by modeling on a large and heterogeneous dataset, consisting of 60 wells across five different oil and gas assets.
The predictive performance is analyzed on historical and future test data, where we achieve an average error of 5-6% and 9-13% for the 50% best performing models.
arXiv Detail & Related papers (2021-02-02T09:05:19Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z) - Gaussian Process Boosting [6.85316573653194]
We introduce a novel way to combine boosting with Gaussian process and mixed effects models.
We obtain increased prediction accuracy compared to existing approaches on simulated and real-world data sets.
arXiv Detail & Related papers (2020-04-06T13:19:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.