Related papers: Robustness to Unbounded Smoothness of Generalized SignSGD

Robustness to Unbounded Smoothness of Generalized SignSGD

URL: http://arxiv.org/abs/2208.11195v1
Date: Tue, 23 Aug 2022 21:11:19 GMT
Title: Robustness to Unbounded Smoothness of Generalized SignSGD
Authors: Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, Zhenxun Zhuang
Abstract summary: We show that momentum plays a critical role in analyzing SignSGD-type and Adamtype algorithms. We compare these algorithms with popular tasks, observing that we can match the performance of Adam while beating the others.
Score: 25.07411035728305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional analyses in non-convex optimization typically rely on the smoothness assumption, namely requiring the gradients to be Lipschitz. However, recent evidence shows that this smoothness condition does not capture the properties of some deep learning objective functions, including the ones involving Recurrent Neural Networks and LSTMs. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this relaxed assumption, it has been theoretically and empirically shown that the gradient-clipped SGD has an advantage over the vanilla one. In this paper, we show that clipping is not indispensable for Adam-type algorithms in tackling such scenarios: we theoretically prove that a generalized SignSGD algorithm can obtain similar convergence rates as SGD with clipping but does not need explicit clipping at all. This family of algorithms on one end recovers SignSGD and on the other end closely resembles the popular Adam algorithm. Our analysis underlines the critical role that momentum plays in analyzing SignSGD-type and Adam-type algorithms: it not only reduces the effects of noise, thus removing the need for large mini-batch in previous analyses of SignSGD-type algorithms, but it also substantially reduces the effects of unbounded smoothness and gradient norms. We also compare these algorithms with popular optimizers on a set of deep learning tasks, observing that we can match the performance of Adam while beating the others.

Related papers

The Performance Of The Unadjusted Langevin Algorithm Without Smoothness Assumptions [0.0]
We propose a Langevin-based algorithm that does not rely on popular but computationally challenging techniques.<n>We show that the performance of samplers like ULA does not necessarily degenerate arbitrarily with low regularity.
arXiv Detail & Related papers (2025-02-05T18:55:54Z)
Random Scaling and Momentum for Non-smooth Non-convex Optimization [38.443430569753026]
Training neural networks requires a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on gradient descent with momentum (SGDM), for which analysis applies only if the loss is either convex or smooth.
arXiv Detail & Related papers (2024-05-16T00:52:03Z)
Clipped SGD Algorithms for Performative Prediction: Tight Bounds for Clipping Bias and Remedies [28.699424769503764]
This paper studies the convergence of clipped gradient descent (SGD) algorithms with decision-dependent data distribution. We characterize clipping bias with clipped (PCSGD) which is caused by clipping operator that prevents SGDGD from reaching a stable solution. Our analysis is also extended to show that the latter algorithm is free from clipping bias in the performative setting.
arXiv Detail & Related papers (2024-04-17T02:17:05Z)
On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions [4.9495085874952895]
Adaptive Momentum Estimation (Adam) algorithm is highly effective in various deep learning tasks. We show that Adam can find a stationary point variance with a rate in high iterations under this general noise model.
arXiv Detail & Related papers (2024-02-06T13:19:26Z)
Rethinking PGD Attack: Is Sign Function Necessary? [131.6894310945647]
We present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance. We propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign. The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments.
arXiv Detail & Related papers (2023-12-03T02:26:58Z)
Adaptive Strategies in Non-convex Optimization [5.279475826661643]
An algorithm is said to be adaptive to a certain parameter if it does not need a priori knowledge of such a parameter. This dissertation presents our work on adaptive algorithms in three scenarios.
arXiv Detail & Related papers (2023-06-17T06:52:05Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization [94.06564567766475]
DP-SGD and DP-NSGD mitigate the risk of large models memorizing sensitive training data. We show that these two algorithms achieve similar best accuracy while DP-NSGD is comparatively easier to tune than DP-SGD.
arXiv Detail & Related papers (2022-06-27T03:45:02Z)
Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent [0.0]
The gradient noise of Gradient Descent (SGD) is considered to play a key role in its properties. We show that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
arXiv Detail & Related papers (2021-12-14T02:25:43Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$. We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z)
Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses. Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
Convergence of Online Adaptive and Recurrent Optimization Algorithms [0.0]
We prove local convergence of several notable descent algorithms used in machine learning. We adopt an "ergodic" rather than probabilistic viewpoint, working with empirical time averages instead of probability distributions.
arXiv Detail & Related papers (2020-05-12T09:48:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.