Generalized equivalences between subsampling and ridge regularization
- URL: http://arxiv.org/abs/2305.18496v2
- Date: Wed, 18 Oct 2023 02:14:06 GMT
- Title: Generalized equivalences between subsampling and ridge regularization
- Authors: Pratik Patil and Jin-Hong Du
- Abstract summary: We prove structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators.
An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio.
- Score: 3.1346887720803505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We establish precise structural and risk equivalences between subsampling and
ridge regularization for ensemble ridge estimators. Specifically, we prove that
linear and quadratic functionals of subsample ridge estimators, when fitted
with different ridge regularization levels $\lambda$ and subsample aspect
ratios $\psi$, are asymptotically equivalent along specific paths in the
$(\lambda,\psi)$-plane (where $\psi$ is the ratio of the feature dimension to
the subsample size). Our results only require bounded moment assumptions on
feature and response distributions and allow for arbitrary joint distributions.
Furthermore, we provide a data-dependent method to determine the equivalent
paths of $(\lambda,\psi)$. An indirect implication of our equivalences is that
optimally tuned ridge regression exhibits a monotonic prediction risk in the
data aspect ratio. This resolves a recent open problem raised by Nakkiran et
al. for general data distributions under proportional asymptotics, assuming a
mild regularity condition that maintains regression hardness through linearized
signal-to-noise ratios.
Related papers
- Precise Asymptotics of Bagging Regularized M-estimators [5.165142221427928]
We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators.
Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples.
Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
arXiv Detail & Related papers (2024-09-23T17:48:28Z) - Ensemble linear interpolators: The role of ensembling [5.135730286836428]
Interpolators are unstable, for example, the mininum $ell$ norm least square interpolator exhibits test errors when dealing with noisy data.
We study how ensemble stabilizes and thus improves the unbounded performance, measured by the out-of-sample prediction risk, of an individual interpolator.
arXiv Detail & Related papers (2023-09-06T20:38:04Z) - Adaptive Annealed Importance Sampling with Constant Rate Progress [68.8204255655161]
Annealed Importance Sampling (AIS) synthesizes weighted samples from an intractable distribution.
We propose the Constant Rate AIS algorithm and its efficient implementation for $alpha$-divergences.
arXiv Detail & Related papers (2023-06-27T08:15:28Z) - Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation [4.87717454493713]
We study subsampling-based ridge ensembles in the proportionals regime.
We prove that the risk of the optimal full ridgeless ensemble (fitted on all possible subsamples) matches that of the optimal ridge predictor.
arXiv Detail & Related papers (2023-04-25T17:43:27Z) - Semi-parametric inference based on adaptively collected data [34.56133468275712]
We construct suitably weighted estimating equations that account for adaptivity in data collection.
Our results characterize the degree of "explorability" required for normality to hold.
We illustrate our general theory with concrete consequences for various problems, including standard linear bandits and sparse generalized bandits.
arXiv Detail & Related papers (2023-03-05T00:45:32Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner.
We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation.
We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Tight Nonparametric Convergence Rates for Stochastic Gradient Descent
under the Noiseless Linear Model [0.0]
We analyze the convergence of single-pass, fixed step-size gradient descent on the least-square risk under this model.
As a special case, we analyze an online algorithm for estimating a real function on the unit interval from the noiseless observation of its value at randomly sampled points.
arXiv Detail & Related papers (2020-06-15T08:25:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.