Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization
- URL: http://arxiv.org/abs/2502.01347v1
- Date: Mon, 03 Feb 2025 13:38:42 GMT
- Title: Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization
- Authors: Simone Bombari, Marco Mondelli,
- Abstract summary: Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data.
We quantify the amount of spurious correlations $C$ learned via linear regression, in terms of the data covariance and the strength $lambda$ of the ridge regularization.
- Score: 19.261178173399784
- License:
- Abstract: Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness. In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive core feature $x$ and a spurious feature $y$. Specifically, we quantify the amount of spurious correlations $C$ learned via linear regression, in terms of the data covariance and the strength $\lambda$ of the ridge regularization. As a consequence, we first capture the simplicity of $y$ through the spectrum of its covariance, and its correlation with $x$ through the Schur complement of the full data covariance. Next, we prove a trade-off between $C$ and the in-distribution test loss $L$, by showing that the value of $\lambda$ that minimizes $L$ lies in an interval where $C$ is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression. Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.
Related papers
- Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Efficient Truncated Linear Regression with Unknown Noise Variance [26.870279729431328]
We provide the first computationally and statistically efficient estimators for truncated linear regression when the noise variance is unknown.
Our estimator is based on an efficient implementation of Projected Gradient Descent on the negative-likelihood of the truncated sample.
arXiv Detail & Related papers (2022-08-25T12:17:37Z) - Observable adjustments in single-index models for regularized
M-estimators [3.5353632767823506]
In regime where sample size $n$ and dimension $p$ are both increasing, the behavior of the empirical distribution of $hatbeta$ and the predicted values $Xhatbeta$ has been previously characterized.
This paper develops a different theory to describe the empirical distribution of $hatbeta$ and $Xhatbeta$.
arXiv Detail & Related papers (2022-04-14T14:32:02Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality.
We prove that quantile regression suffers from an inherent under-coverage bias.
Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv Detail & Related papers (2021-06-10T06:11:55Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$.
The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z) - Query Complexity of Least Absolute Deviation Regression via Robust
Uniform Convergence [26.51921319084656]
We show that the query complexity of least absolute deviation regression is $Theta(d/epsilon2)$ up to logarithmic factors.
We introduce the notion of robust uniform convergence, which is a new approximation guarantee for the empirical loss.
arXiv Detail & Related papers (2021-02-03T22:54:27Z) - Computationally and Statistically Efficient Truncated Regression [36.3677715543994]
We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression.
Our estimator uses Projected Descent Gradient (PSGD) without replacement on the negative log-likelihood of the truncated sample.
As a corollary, we show that SGD learns the parameters of single-layer neural networks with noisy activation functions.
arXiv Detail & Related papers (2020-10-22T19:31:30Z) - A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian
Kernel, a Precise Phase Transition, and the Corresponding Double Descent [85.77233010209368]
This article characterizes the exacts of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$ is all large and comparable.
This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$.
arXiv Detail & Related papers (2020-06-09T02:05:40Z) - A Precise High-Dimensional Asymptotic Theory for Boosting and
Minimum-$\ell_1$-Norm Interpolated Classifiers [3.167685495996986]
This paper establishes a precise high-dimensional theory for boosting on separable data.
Under a class of statistical models, we provide an exact analysis of the universality error of boosting.
We also explicitly pin down the relation between the boosting test error and the optimal Bayes error.
arXiv Detail & Related papers (2020-02-05T00:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.