Related papers: Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures

Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures

URL: http://arxiv.org/abs/2510.00557v1
Date: Wed, 01 Oct 2025 06:18:57 GMT
Title: Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures
Authors: Kelvyn K. Bladen, D. Richard Cutler, Alan Wisler,
Abstract summary: Two approaches are Permute-and-Predict (PaP), which randomly permutes a feature in a validation set, and Leave-One-Co-Out (LOCO), which retrains models after permuting a training feature.<n>This work bridges empirical evidence and theory, enhancing the interpretability and application of variable importance measures.
Score: 0.45880283710344066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In many machine learning problems, understanding variable importance is a central concern. Two common approaches are Permute-and-Predict (PaP), which randomly permutes a feature in a validation set, and Leave-One-Covariate-Out (LOCO), which retrains models after permuting a training feature. Both methods deem a variable important if predictions with the original data substantially outperform those with permutations. In linear regression, empirical studies have linked PaP to regression coefficients and LOCO to $t$-statistics, but a formal theory has been lacking. We derive closed-form expressions for both measures, expressed using square-root transformations. PaP is shown to be proportional to the coefficient and predictor variability: $\text{PaP}_i = \beta_i \sqrt{2\operatorname{Var}(\mathbf{x}^v_i)}$, while LOCO is proportional to the coefficient but dampened by collinearity (captured by $\Delta$): $\text{LOCO}_i = \beta_i (1 -\Delta)\sqrt{1 + c}$. These derivations explain why PaP is largely unaffected by multicollinearity, whereas LOCO is highly sensitive to it. Monte Carlo simulations confirm these findings across varying levels of collinearity. Although derived for linear regression, we also show that these results provide reasonable approximations for models like Random Forests. Overall, this work establishes a theoretical basis for two widely used importance measures, helping analysts understand how they are affected by the true coefficients, dimension, and covariance structure. This work bridges empirical evidence and theory, enhancing the interpretability and application of variable importance measures.

Related papers

Disentangled Feature Importance [0.0]
We introduce emphDisentangled Feature Importance (DFI), a nonparametric generalization of the classical $R2$ decomposition via optimal transport.<n>DFI correlated features into independent latent variables using a transport map, eliminating correlation distortion.<n>DFI provides a principled decomposition of importance scores that sum to the total predictive variability for latent additive models.
arXiv Detail & Related papers (2025-06-30T20:54:48Z)
Multiply-Robust Causal Change Attribution [15.501106533308798]
We develop a new estimation strategy that combines regression and re-weighting methods to quantify the contribution of each causal mechanism. Our method demonstrates excellent performance in Monte Carlo simulations, and we show its usefulness in an empirical application.
arXiv Detail & Related papers (2024-04-12T22:57:01Z)
TIC-TAC: A Framework for Improved Covariance Estimation in Deep Heteroscedastic Regression [109.69084997173196]
Deepscedastic regression involves jointly optimizing the mean and covariance of the predicted distribution using the negative log-likelihood. Recent works show that this may result in sub-optimal convergence due to the challenges associated with covariance estimation. We study two questions: (1) Does the predicted covariance truly capture the randomness of the predicted mean? Our results show that not only does TIC accurately learn the covariance, it additionally facilitates an improved convergence of the negative log-likelihood.
arXiv Detail & Related papers (2023-10-29T09:54:03Z)
Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation and Regression [53.15502562048627]
Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator. This work delves into a statistical analysis of augmentation-based pretraining.
arXiv Detail & Related papers (2023-06-01T15:18:55Z)
Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data [1.6099403809839032]
The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm. It provides balance between accurate prediction and efficient interpretation. Code is provided as an open-source package in R.
arXiv Detail & Related papers (2023-01-17T21:50:35Z)
Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data. For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z)
On the Strong Correlation Between Model Invariance and Generalization [54.812786542023325]
Generalization captures a model's ability to classify unseen data. Invariance measures consistency of model predictions on transformations of the data. From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets.
arXiv Detail & Related papers (2022-07-14T17:08:25Z)
Decorrelated Variable Importance [0.0]
We propose a method for mitigating the effect of correlation by defining a modified version of LOCO. This new parameter is difficult to estimate nonparametrically, but we show how to estimate it using semiparametric models.
arXiv Detail & Related papers (2021-11-21T16:31:36Z)
Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality. We prove that quantile regression suffers from an inherent under-coverage bias. Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv Detail & Related papers (2021-06-10T06:11:55Z)
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests [87.60900567941428]
A spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character. We study stress testing using the tools of causal inference.
arXiv Detail & Related papers (2021-05-31T14:39:38Z)
What causes the test error? Going beyond bias-variance via ANOVA [21.359033212191218]
Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. Recent work aimed to understand in greater depth why overparametrization is helpful for generalization. We propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way.
arXiv Detail & Related papers (2020-10-11T05:21:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.