Robustness to Unbounded Smoothness of Generalized SignSGD
- URL: http://arxiv.org/abs/2208.11195v1
- Date: Tue, 23 Aug 2022 21:11:19 GMT
- Title: Robustness to Unbounded Smoothness of Generalized SignSGD
- Authors: Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, Zhenxun
Zhuang
- Abstract summary: We show that momentum plays a critical role in analyzing SignSGD-type and Adamtype algorithms.
We compare these algorithms with popular tasks, observing that we can match the performance of Adam while beating the others.
- Score: 25.07411035728305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional analyses in non-convex optimization typically rely on the
smoothness assumption, namely requiring the gradients to be Lipschitz. However,
recent evidence shows that this smoothness condition does not capture the
properties of some deep learning objective functions, including the ones
involving Recurrent Neural Networks and LSTMs. Instead, they satisfy a much
more relaxed condition, with potentially unbounded smoothness. Under this
relaxed assumption, it has been theoretically and empirically shown that the
gradient-clipped SGD has an advantage over the vanilla one. In this paper, we
show that clipping is not indispensable for Adam-type algorithms in tackling
such scenarios: we theoretically prove that a generalized SignSGD algorithm can
obtain similar convergence rates as SGD with clipping but does not need
explicit clipping at all. This family of algorithms on one end recovers SignSGD
and on the other end closely resembles the popular Adam algorithm. Our analysis
underlines the critical role that momentum plays in analyzing SignSGD-type and
Adam-type algorithms: it not only reduces the effects of noise, thus removing
the need for large mini-batch in previous analyses of SignSGD-type algorithms,
but it also substantially reduces the effects of unbounded smoothness and
gradient norms. We also compare these algorithms with popular optimizers on a
set of deep learning tasks, observing that we can match the performance of Adam
while beating the others.
Related papers
- Random Scaling and Momentum for Non-smooth Non-convex Optimization [38.443430569753026]
Training neural networks requires a loss function that may be highly irregular, and in particular neither convex nor smooth.
Popular training algorithms are based on gradient descent with momentum (SGDM), for which analysis applies only if the loss is either convex or smooth.
arXiv Detail & Related papers (2024-05-16T00:52:03Z) - On Convergence of Adam for Stochastic Optimization under Relaxed
Assumptions [4.9495085874952895]
Adaptive Momentum Estimation (Adam) algorithm is highly effective in various deep learning tasks.
We show that Adam can find a stationary point variance with a rate in high iterations under this general noise model.
arXiv Detail & Related papers (2024-02-06T13:19:26Z) - Rethinking PGD Attack: Is Sign Function Necessary? [131.6894310945647]
We present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance.
We propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign.
The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments.
arXiv Detail & Related papers (2023-12-03T02:26:58Z) - Adaptive Strategies in Non-convex Optimization [5.279475826661643]
An algorithm is said to be adaptive to a certain parameter if it does not need a priori knowledge of such a parameter.
This dissertation presents our work on adaptive algorithms in three scenarios.
arXiv Detail & Related papers (2023-06-17T06:52:05Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Normalized/Clipped SGD with Perturbation for Differentially Private
Non-Convex Optimization [94.06564567766475]
DP-SGD and DP-NSGD mitigate the risk of large models memorizing sensitive training data.
We show that these two algorithms achieve similar best accuracy while DP-NSGD is comparatively easier to tune than DP-SGD.
arXiv Detail & Related papers (2022-06-27T03:45:02Z) - Non Asymptotic Bounds for Optimization via Online Multiplicative
Stochastic Gradient Descent [0.0]
The gradient noise of Gradient Descent (SGD) is considered to play a key role in its properties.
We show that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties.
We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
arXiv Detail & Related papers (2021-12-14T02:25:43Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z) - Convergence of Online Adaptive and Recurrent Optimization Algorithms [0.0]
We prove local convergence of several notable descent algorithms used in machine learning.
We adopt an "ergodic" rather than probabilistic viewpoint, working with empirical time averages instead of probability distributions.
arXiv Detail & Related papers (2020-05-12T09:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.