Related papers: Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results

Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results

URL: http://arxiv.org/abs/2410.16561v1
Date: Mon, 21 Oct 2024 22:40:42 GMT
Title: Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results
Authors: Tao Sun, Xinwang Liu, Kun Yuan,
Abstract summary: This paper investigates Gradient Normalization without (NSGDC) its gradient reduction variant (NSGDC-VR) We present significant improvements in the theoretical results for both algorithms.
Score: 60.92029979853314
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: This paper investigates Gradient Normalization Stochastic Gradient Descent without Clipping (NSGDC) and its variance reduction variant (NSGDC-VR) for nonconvex optimization under heavy-tailed noise. We present significant improvements in the theoretical results for both algorithms, including the removal of logarithmic factors from the convergence rates and the recovery of the convergence rate to match the deterministic case when the noise variance {\sigma} is zero. Additionally, we demonstrate that gradient normalization alone, assuming individual Lipschitz smoothness, is sufficient to ensure convergence of SGD under heavy-tailed noise, eliminating the need for gradient clipping. Furthermore, we introduce accelerated nonconvex algorithms that utilize second-order Lipschitz smoothness to achieve enhanced convergence rates in the presence of heavy-tailed noise. Our findings offer a deeper understanding of how gradient normalization and variance reduction techniques can be optimized for robust performance in challenging optimization scenarios.

Related papers

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
We show that AdaGrad outperforms SGD by a factor of $d$ under certain conditions. Motivated by this, we introduce assumptions on the smoothness structure of the objective and the gradient variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z)
Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z)
Signal Processing Meets SGD: From Momentum to Filter [6.751292200515355]
In deep learning, gradient descent (SGD) and its momentum-based variants are widely used for optimization.<n>In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence updates.<n>We introduce a novel method SGDF based on Wiener Filter principles, which derives an optimal time-varying gain to refine updates.
arXiv Detail & Related papers (2023-11-06T01:41:46Z)
Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the Bounded Gradient Assumption [11.367487348673793]
We prove that various gradient descent methods, including the gradient descent (SGD), heavy-ball (SHB) and Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold. This is the first time such results are obtained for SHB and SNAG methods.
arXiv Detail & Related papers (2023-02-15T18:53:41Z)
The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance [46.15915820243487]
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft. We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
arXiv Detail & Related papers (2022-02-11T17:37:54Z)
Improving Differentially Private SGD via Randomly Sparsified Gradients [31.295035726077366]
Differentially private gradient observation (DP-SGD) has been widely adopted in deep learning to provide rigorously defined privacy bound compression. We propose an and utilize RS to strengthen communication cost and strengthen privacy bound compression.
arXiv Detail & Related papers (2021-12-01T21:43:34Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations [4.991328448898387]
We propose AdaL, with a transformation on the original gradient. AdaL accelerates the convergence by amplifying the gradient in the early stage, as well as dampens the oscillation and stabilizes the optimization by shrinking the gradient later.
arXiv Detail & Related papers (2021-07-04T02:55:36Z)
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients. We prove new complexity that outperform state-of-the-art results in this case. We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.