Related papers: Precise Asymptotics of Bagging Regularized M-estimators

Precise Asymptotics of Bagging Regularized M-estimators

URL: http://arxiv.org/abs/2409.15252v2
Date: Fri, 11 Oct 2024 06:44:26 GMT
Title: Precise Asymptotics of Bagging Regularized M-estimators
Authors: Takuya Koriyama, Pratik Patil, Jin-Hong Du, Kai Tan, Pierre C. Bellec,
Abstract summary: We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators. Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples. Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
Score: 5.165142221427928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators and construct a consistent estimator for the risk. Specifically, we consider a heterogeneous collection of $M \ge 1$ regularized M-estimators, each trained with (possibly different) subsample sizes, convex differentiable losses, and convex regularizers. We operate under the proportional asymptotics regime, where the sample size $n$, feature size $p$, and subsample sizes $k_m$ for $m \in [M]$ all diverge with fixed limiting ratios $n/p$ and $k_m/n$. Key to our analysis is a new result on the joint asymptotic behavior of correlations between the estimator and residual errors on overlapping subsamples, governed through a (provably) contractible nonlinear system of equations. Of independent interest, we also establish convergence of trace functionals related to degrees of freedom in the non-ensemble setting (with $M = 1$) along the way, extending previously known cases for square loss and ridge, lasso regularizers. When specialized to homogeneous ensembles trained with a common loss, regularizer, and subsample size, the risk characterization sheds some light on the implicit regularization effect due to the ensemble and subsample sizes $(M,k)$. For any ensemble size $M$, optimally tuning subsample size yields sample-wise monotonic risk. For the full-ensemble estimator (when $M \to \infty$), the optimal subsample size $k^\star$ tends to be in the overparameterized regime $(k^\star \le \min\{n,p\})$, when explicit regularization is vanishing. Finally, joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data (without any subagging).

Related papers

Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond [28.931489333515618]
Given an unnormalized probability density $piproptomathrme-V$, estimating its normalizing constant $Z=int_mathbbRdmathrme-V(x)mathrmdx$ or free energy $F=-log Z$ is a crucial problem in Bayesian statistics, statistical mechanics, and machine learning. We propose a new normalizing constant estimation algorithm based on reverse diffusion samplers and establish a framework for analyzing its complexity.
arXiv Detail & Related papers (2025-02-07T00:05:28Z)
Demystifying SGD with Doubly Stochastic Gradients [13.033133586372612]
We establish the convergence properties of doubly SGD with independent minibatching and random reshuffling under general conditions. We prove that random reshuffling improves the complexity dependence on the subsampling noise.
arXiv Detail & Related papers (2024-06-03T01:13:19Z)
The Sample Complexity of Gradient Descent in Stochastic Convex Optimization [14.268363583731848]
We show that the generalization error of full-batch Gradient Descent can be $tilde Theta(d/m + 1/sqrtm)$, where $d$ is the dimension and $m$ is the sample size. This matches the sample complexity of emphworst-case empirical risk minimizers.
arXiv Detail & Related papers (2024-04-07T12:07:33Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Generalized equivalences between subsampling and ridge regularization [3.1346887720803505]
We prove structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio.
arXiv Detail & Related papers (2023-05-29T14:05:51Z)
Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation [4.87717454493713]
We study subsampling-based ridge ensembles in the proportionals regime. We prove that the risk of the optimal full ridgeless ensemble (fitted on all possible subsamples) matches that of the optimal ridge predictor.
arXiv Detail & Related papers (2023-04-25T17:43:27Z)
Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions. Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z)
$p$-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses. We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z)
On the Subbagging Estimation for Massive Data [10.902757578215255]
This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. For the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_Nll N$ to meet the memory constraint is sampled uniformly without replacement. An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.
arXiv Detail & Related papers (2021-02-28T21:38:22Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
Sharp Statistical Guarantees for Adversarially Robust Gaussian Classification [54.22421582955454]
We provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification. Results are stated in terms of the Adversarial Signal-to-Noise Ratio (AdvSNR), which generalizes a similar notion for standard linear classification to the adversarial setting.
arXiv Detail & Related papers (2020-06-29T21:06:52Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.