Relating the Partial Dependence Plot and Permutation Feature Importance
to the Data Generating Process
- URL: http://arxiv.org/abs/2109.01433v1
- Date: Fri, 3 Sep 2021 10:50:41 GMT
- Title: Relating the Partial Dependence Plot and Permutation Feature Importance
to the Data Generating Process
- Authors: Christoph Molnar, Timo Freiesleben, Gunnar K\"onig, Giuseppe
Casalicchio, Marvin N. Wright, Bernd Bischl
- Abstract summary: Partial dependence plots and permutation feature importance (PFI) are often used as interpretation methods.
We formalize PD and PFI as statistical estimators of ground truth estimands rooted in the data generating process.
We show that PD and PFI estimates deviate from this ground truth due to statistical biases, model variance and Monte Carlo approximation errors.
- Score: 1.3782922287772585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientists and practitioners increasingly rely on machine learning to model
data and draw conclusions. Compared to statistical modeling approaches, machine
learning makes fewer explicit assumptions about data structures, such as
linearity. However, their model parameters usually cannot be easily related to
the data generating process. To learn about the modeled relationships, partial
dependence (PD) plots and permutation feature importance (PFI) are often used
as interpretation methods. However, PD and PFI lack a theory that relates them
to the data generating process. We formalize PD and PFI as statistical
estimators of ground truth estimands rooted in the data generating process. We
show that PD and PFI estimates deviate from this ground truth due to
statistical biases, model variance and Monte Carlo approximation errors. To
account for model variance in PD and PFI estimation, we propose the learner-PD
and the learner-PFI based on model refits, and propose corrected variance and
confidence interval estimators.
Related papers
- Influence Functions for Scalable Data Attribution in Diffusion Models [52.92223039302037]
Diffusion models have led to significant advancements in generative modelling.
Yet their widespread adoption poses challenges regarding data attribution and interpretability.
In this paper, we aim to help address such challenges by developing an textitinfluence functions framework.
arXiv Detail & Related papers (2024-10-17T17:59:02Z) - Quantifying Distribution Shifts and Uncertainties for Enhanced Model Robustness in Machine Learning Applications [0.0]
This study explores model adaptation and generalization by utilizing synthetic data.
We employ quantitative measures such as Kullback-Leibler divergence, Jensen-Shannon distance, and Mahalanobis distance to assess data similarity.
Our findings suggest that utilizing statistical measures, such as the Mahalanobis distance, to determine whether model predictions fall within the low-error "interpolation regime" or the high-error "extrapolation regime" provides a complementary method for assessing distribution shift and model uncertainty.
arXiv Detail & Related papers (2024-05-03T10:05:31Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Diffusion models for probabilistic programming [56.47577824219207]
Diffusion Model Variational Inference (DMVI) is a novel method for automated approximate inference in probabilistic programming languages (PPLs)
DMVI is easy to implement, allows hassle-free inference in PPLs without the drawbacks of, e.g., variational inference using normalizing flows, and does not make any constraints on the underlying neural network model.
arXiv Detail & Related papers (2023-11-01T12:17:05Z) - Measuring Causal Effects of Data Statistics on Language Model's
`Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models.
We provide a language for describing how training data influences predictions, through a causal framework.
Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z) - On the Strong Correlation Between Model Invariance and Generalization [54.812786542023325]
Generalization captures a model's ability to classify unseen data.
Invariance measures consistency of model predictions on transformations of the data.
From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets.
arXiv Detail & Related papers (2022-07-14T17:08:25Z) - Optimal regularizations for data generation with probabilistic graphical
models [0.0]
Empirically, well-chosen regularization schemes dramatically improve the quality of the inferred models.
We consider the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models.
arXiv Detail & Related papers (2021-12-02T14:45:16Z) - Variational Gibbs Inference for Statistical Model Estimation from
Incomplete Data [7.4250022679087495]
We introduce variational Gibbs inference (VGI), a new general-purpose method to estimate the parameters of statistical models from incomplete data.
We validate VGI on a set of synthetic and real-world estimation tasks, estimating important machine learning models such as variational autoencoders and normalising flows from incomplete data.
arXiv Detail & Related papers (2021-11-25T17:22:22Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.