Related papers: Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization

Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization

URL: http://arxiv.org/abs/2507.09093v1
Date: Sat, 12 Jul 2025 00:31:13 GMT
Title: Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization
Authors: Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, Soummya Kar,
Abstract summary: We propose two novel estimators based on the idea of noise symmetrization.<n>We provide a sharper analysis and improved rates.<n>Compared to works assuming symmetric noise with moments, we provide a sharper analysis and improved rates.
Score: 50.49466204159458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.

Related papers

Stochastic Weakly Convex Optimization Under Heavy-Tailed Noises [55.43924214633558]
In this paper, we focus on two types of noises: one is sub-Weibull noises, and the other is SsBC noises.<n>Under these two noise assumptions, the in-expectation and high-probability convergence of SFOMs have been studied in the contexts of convex optimization and smooth optimization.
arXiv Detail & Related papers (2025-07-17T16:48:45Z)
Large Deviation Upper Bounds and Improved MSE Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry [47.653744900375855]
We provide a framework of optimal convex gradient methods in the presence of heavy-tailed noise.<n>We show an explicit dependence on the choice of step-size-tailed nonlinearity, noise and problem parameters.<n>We provide an MSE $widetmathcalO(t-1)$ that can be made close to the optimal $mathcalO(t-1)$.
arXiv Detail & Related papers (2024-10-21T04:50:57Z)
Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees [56.80920351680438]
We study high-probability convergence in online learning, in the presence of heavy-tailed noise.<n>We provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments.
arXiv Detail & Related papers (2024-10-17T18:25:28Z)
From Gradient Clipping to Normalization for Heavy Tailed SGD [19.369399536643773]
Recent empirical evidence indicates that machine learning applications involve heavy-tailed noise, which challenges the standard assumptions of bounded variance in practice.<n>In this paper, we show that it is possible to achieve tightness of the gradient-dependent noise convergence problem under tailed noise.
arXiv Detail & Related papers (2024-10-17T17:59:01Z)
Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density. We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order. We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z)
High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise [59.25598762373543]
We show that wetailed high-prob convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. We demonstrate analytically and that $ta$ can be used to the preferred choice of setting for a given problem.
arXiv Detail & Related papers (2023-10-28T18:53:41Z)
Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise [11.768495184175052]
We introduce a general framework for nonlinear gradient descent scenarios when gradient noise exhibits heavy tails. We show that for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD converges to zero at rate$O(/tzeta)$, $zeta in (0,1)$. Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets
arXiv Detail & Related papers (2022-04-06T06:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.