Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD
- URL: http://arxiv.org/abs/2410.20135v1
- Date: Sat, 26 Oct 2024 10:14:17 GMT
- Title: Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD
- Authors: Aniket Das, Dheeraj Nagaraj, Soumyabrata Pal, Arun Suggala, Prateek Varshney,
- Abstract summary: We show that Clipped-SGD, for smooth and strongly convex objectives, achieves an error of $sqrtfracmathsfTr(Sigma)+sqrtmathsfTr(Sigma)+sqrtmathsfTr(Sigma)+sqrtmathsfTr(Sigma)+sqrtmathsfTr(Sigma)+sqrtmathsfTr(Sigma)+sqrtmathsf
- Score: 16.019880089338383
- License:
- Abstract: We consider the problem of high-dimensional heavy-tailed statistical estimation in the streaming setting, which is much harder than the traditional batch setting due to memory constraints. We cast this problem as stochastic convex optimization with heavy tailed stochastic gradients, and prove that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite. More precisely, with $T$ samples, we show that Clipped-SGD, for smooth and strongly convex objectives, achieves an error of $\sqrt{\frac{\mathsf{Tr}(\Sigma)+\sqrt{\mathsf{Tr}(\Sigma)\|\Sigma\|_2}\log(\frac{\log(T)}{\delta})}{T}}$ with probability $1-\delta$, where $\Sigma$ is the covariance of the clipped gradient. Note that the fluctuations (depending on $\frac{1}{\delta}$) are of lower order than the term $\mathsf{Tr}(\Sigma)$. This improves upon the current best rate of $\sqrt{\frac{\mathsf{Tr}(\Sigma)\log(\frac{1}{\delta})}{T}}$ for Clipped-SGD, known only for smooth and strongly convex objectives. Our results also extend to smooth convex and lipschitz convex objectives. Key to our result is a novel iterative refinement strategy for martingale concentration, improving upon the PAC-Bayes approach of Catoni and Giulini.
Related papers
- Mirror Descent Algorithms with Nearly Dimension-Independent Rates for
Differentially-Private Stochastic Saddle-Point Problems [6.431793114484429]
We propose $sqrtlog(d)/sqrtn + log(d)/[nvarepsilon]2/5$ to solve the problem of differentially-private saddle-points in the polyhedral setting.
We show that our algorithms attain a rate of $sqrtlog(d)/sqrtn + log(d)/[nvarepsilon]2/5$ with constant success.
arXiv Detail & Related papers (2024-03-05T12:28:00Z) - (Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum [7.095058159492494]
heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over iterations of gradient descent.
We show that SHB can attain an accelerated acceleration when the mini-size is larger than a threshold $b* that that on the number $kappa.
arXiv Detail & Related papers (2024-01-12T18:17:28Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - High Probability Convergence of Stochastic Gradient Methods [15.829413808059124]
We show convergence with bounds depending on the initial distance to the optimal solution.
We demonstrate that our techniques can be used to obtain high bound for AdaGrad-Norm.
arXiv Detail & Related papers (2023-02-28T18:42:11Z) - Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization [40.19608203064051]
We investigate theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model.
For strongly convex and smooth functions, we establish an $mathcalO(sigma_max2 + Sigma_max2) log, better than their $mathcalO(sigma_max2 + Sigma_max2) log T.
We establish the first dynamic regret guarantee for the model with convex and smooth functions, which is more favorable than static regret bounds in non-stationary scenarios
arXiv Detail & Related papers (2023-02-09T10:42:11Z) - Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$.
We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z) - Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior [8.76346911214414]
We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD)
We can also obtain new bounds for certain variants of SGD.
arXiv Detail & Related papers (2022-05-27T07:23:01Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Tight Regret Bounds for Noisy Optimization of a Brownian Motion [118.65407541895851]
We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise.
We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $Omega(sigmasqrtT cdot log T)$.
arXiv Detail & Related papers (2020-01-25T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.