Precise Asymptotics of Bagging Regularized M-estimators
- URL: http://arxiv.org/abs/2409.15252v2
- Date: Fri, 11 Oct 2024 06:44:26 GMT
- Title: Precise Asymptotics of Bagging Regularized M-estimators
- Authors: Takuya Koriyama, Pratik Patil, Jin-Hong Du, Kai Tan, Pierre C. Bellec,
- Abstract summary: We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators.
Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples.
Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
- Score: 5.165142221427928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators and construct a consistent estimator for the risk. Specifically, we consider a heterogeneous collection of $M \ge 1$ regularized M-estimators, each trained with (possibly different) subsample sizes, convex differentiable losses, and convex regularizers. We operate under the proportional asymptotics regime, where the sample size $n$, feature size $p$, and subsample sizes $k_m$ for $m \in [M]$ all diverge with fixed limiting ratios $n/p$ and $k_m/n$. Key to our analysis is a new result on the joint asymptotic behavior of correlations between the estimator and residual errors on overlapping subsamples, governed through a (provably) contractible nonlinear system of equations. Of independent interest, we also establish convergence of trace functionals related to degrees of freedom in the non-ensemble setting (with $M = 1$) along the way, extending previously known cases for square loss and ridge, lasso regularizers. When specialized to homogeneous ensembles trained with a common loss, regularizer, and subsample size, the risk characterization sheds some light on the implicit regularization effect due to the ensemble and subsample sizes $(M,k)$. For any ensemble size $M$, optimally tuning subsample size yields sample-wise monotonic risk. For the full-ensemble estimator (when $M \to \infty$), the optimal subsample size $k^\star$ tends to be in the overparameterized regime $(k^\star \le \min\{n,p\})$, when explicit regularization is vanishing. Finally, joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data (without any subagging).
Related papers
- Demystifying SGD with Doubly Stochastic Gradients [13.033133586372612]
We establish the convergence properties of doubly SGD with independent minibatching and random reshuffling under general conditions.
We prove that random reshuffling improves the complexity dependence on the subsampling noise.
arXiv Detail & Related papers (2024-06-03T01:13:19Z) - The Sample Complexity of Gradient Descent in Stochastic Convex Optimization [14.268363583731848]
We show that the generalization error of full-batch Gradient Descent can be $tilde Theta(d/m + 1/sqrtm)$, where $d$ is the dimension and $m$ is the sample size.
This matches the sample complexity of emphworst-case empirical risk minimizers.
arXiv Detail & Related papers (2024-04-07T12:07:33Z) - Generalized equivalences between subsampling and ridge regularization [3.1346887720803505]
We prove structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators.
An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio.
arXiv Detail & Related papers (2023-05-29T14:05:51Z) - Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation [4.87717454493713]
We study subsampling-based ridge ensembles in the proportionals regime.
We prove that the risk of the optimal full ridgeless ensemble (fitted on all possible subsamples) matches that of the optimal ridge predictor.
arXiv Detail & Related papers (2023-04-25T17:43:27Z) - Stochastic Approximation Approaches to Group Distributionally Robust
Optimization [96.26317627118912]
Group distributionally robust optimization (GDRO)
Online learning techniques to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample.
A novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - On Stochastic Moving-Average Estimators for Non-Convex Optimization [105.22760323075008]
In this paper, we demonstrate the power of a widely used estimator based on moving average (SEMA) problems.
For all these-the-art results, we also present the results for all these-the-art problems.
arXiv Detail & Related papers (2021-04-30T08:50:24Z) - On the Subbagging Estimation for Massive Data [10.902757578215255]
This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers.
For the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_Nll N$ to meet the memory constraint is sampled uniformly without replacement.
An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.
arXiv Detail & Related papers (2021-02-28T21:38:22Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Sharp Statistical Guarantees for Adversarially Robust Gaussian
Classification [54.22421582955454]
We provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification.
Results are stated in terms of the Adversarial Signal-to-Noise Ratio (AdvSNR), which generalizes a similar notion for standard linear classification to the adversarial setting.
arXiv Detail & Related papers (2020-06-29T21:06:52Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.