Failures and Successes of Cross-Validation for Early-Stopped Gradient
Descent
- URL: http://arxiv.org/abs/2402.16793v1
- Date: Mon, 26 Feb 2024 18:07:27 GMT
- Title: Failures and Successes of Cross-Validation for Early-Stopped Gradient
Descent
- Authors: Pratik Patil, Yuchen Wu, Ryan J. Tibshirani
- Abstract summary: We analyze the statistical properties of generalized cross-validation (GCV) and leave-one-out cross-validation (LOOCV) applied to early-stopped descent gradient (GD)
We prove that GCV is generically inconsistent as an estimator of the prediction risk of early-stopped GD, even for a well-specified linear model with isotropic features.
Our theory requires only mild assumptions on the data distribution and does not require the underlying regression function to be linear.
- Score: 8.0225129190882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze the statistical properties of generalized cross-validation (GCV)
and leave-one-out cross-validation (LOOCV) applied to early-stopped gradient
descent (GD) in high-dimensional least squares regression. We prove that GCV is
generically inconsistent as an estimator of the prediction risk of
early-stopped GD, even for a well-specified linear model with isotropic
features. In contrast, we show that LOOCV converges uniformly along the GD
trajectory to the prediction risk. Our theory requires only mild assumptions on
the data distribution and does not require the underlying regression function
to be linear. Furthermore, by leveraging the individual LOOCV errors, we
construct consistent estimators for the entire prediction error distribution
along the GD trajectory and consistent estimators for a wide class of error
functionals. This in particular enables the construction of pathwise prediction
intervals based on GD iterates that have asymptotically correct nominal
coverage conditional on the training data.
Related papers
- Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning [5.293069542318491]
We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles.
For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally tuning by only sketch size in infinite ensembles.
We also propose an "ensemble trick" whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles.
arXiv Detail & Related papers (2023-10-06T16:27:43Z) - Corrected generalized cross-validation for finite ensembles of penalized estimators [5.165142221427927]
Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk.
We show that GCV is inconsistent for any finite ensemble of size greater than one.
arXiv Detail & Related papers (2023-10-02T17:38:54Z) - BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery [97.79015388276483]
A structural equation model (SEM) is an effective framework to reason over causal relationships represented via a directed acyclic graph (DAG)
Recent advances enabled effective maximum-likelihood point estimation of DAGs from observational data.
We propose BCD Nets, a variational framework for estimating a distribution over DAGs characterizing a linear-Gaussian SEM.
arXiv Detail & Related papers (2021-12-06T03:35:21Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner.
We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation.
We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Unlabelled Data Improves Bayesian Uncertainty Calibration under
Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation.
We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z) - On Low-rank Trace Regression under General Sampling Distribution [9.699586426043885]
We show that cross-validated estimators satisfy near-optimal error bounds on general assumptions.
We also show that the cross-validated estimator outperforms the theory-inspired approach of selecting the parameter.
arXiv Detail & Related papers (2019-04-18T02:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.