Related papers: Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise

Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise

URL: http://arxiv.org/abs/2601.19730v1
Date: Tue, 27 Jan 2026 15:50:57 GMT
Title: Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise
Authors: Hongxu Chen, Ke Wei, Xiaoming Yuan, Luo Luo,
Abstract summary: We develop a general framework for establishing bounds under heavy-tailed noise.<n>We provide the stability and generalization analysis for several popular algorithms under heavy-tailed noise.
Score: 24.27538723361077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The empirical evidence indicates that stochastic optimization with heavy-tailed gradient noise is more appropriate to characterize the training of machine learning models than that with standard bounded gradient variance noise. Most existing works on this phenomenon focus on the convergence of optimization errors, while the analysis for generalization bounds under the heavy-tailed gradient noise remains limited. In this paper, we develop a general framework for establishing generalization bounds under heavy-tailed noise. Specifically, we introduce a truncation argument to achieve the generalization error bound based on the algorithmic stability under the assumption of bounded $p$th centered moment with $p\in(1,2]$. Building on this framework, we further provide the stability and generalization analysis for several popular stochastic algorithms under heavy-tailed noise, including clipped and normalized stochastic gradient descent, as well as their mini-batch and momentum variants.

Related papers

Accelerated stochastic first-order method for convex optimization under heavy-tailed noise [3.5877352183559386]
We study convex composite optimization problems, where the objective function is given by the sum of a prox-friendly function and a convex function whose subgradients are estimated under heavy-tailed noise.<n>We demonstrate that a vanilla algorithm can achieve optimal complexity for these problems without additional modifications such as clipping or normalization.
arXiv Detail & Related papers (2025-10-13T17:45:05Z)
Nonconvex Decentralized Stochastic Bilevel Optimization under Heavy-Tailed Noises [27.097405340259325]
Existing decentralized optimization methods assume the lower-level loss function is strongly convex and the gradient noise has finite variance.<n>This is the first decentralized bilevel algorithm with with rigorous theoretical guarantees under heavy-tailed noises.
arXiv Detail & Related papers (2025-09-19T02:51:19Z)
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise. Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z)
Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density. We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order. We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z)
Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys. Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z)
First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities [91.46841922915418]
We present a unified approach for the theoretical analysis of first-order variation methods. Our approach covers both non-linear gradient and strongly Monte Carlo problems. We provide bounds that match the oracle strongly in the case of convex method optimization problems.
arXiv Detail & Related papers (2023-05-25T11:11:31Z)
SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance [33.593203156666746]
We show that AdaGrad stepsizes a popular adaptive (self-tuning) method for first-order optimization. In both the low-noise and high-regimes we find sharp rates of convergence in both the low-noise and high-regimes.
arXiv Detail & Related papers (2023-02-17T09:46:08Z)
Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithms [8.669461942767098]
We study momentum-based first-order optimization algorithms in which the iterations are subject to an additive white noise. For strongly convex quadratic problems, we use the steady-state variance of the error in the optimization variable to quantify noise amplification. We introduce two parameterized families of algorithms that strike a balance between noise amplification and settling time.
arXiv Detail & Related papers (2022-09-24T04:26:30Z)
Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs. Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems. In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z)
Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients. We prove new complexity that outperform state-of-the-art results in this case. We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.