Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization
- URL: http://arxiv.org/abs/2103.17182v1
- Date: Wed, 31 Mar 2021 16:08:06 GMT
- Title: Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization
- Authors: Zeke Xie, Li Yuan, Zhanxing Zhu, and Masashi Sugiyama
- Abstract summary: gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
- Score: 89.7882166459412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is well-known that stochastic gradient noise (SGN) acts as implicit
regularization for deep learning and is essentially important for both
optimization and generalization of deep networks. Some works attempted to
artificially simulate SGN by injecting random noise to improve deep learning.
However, it turned out that the injected simple random noise cannot work as
well as SGN, which is anisotropic and parameter-dependent. For simulating SGN
at low computational costs and without changing the learning rate or batch
size, we propose the Positive-Negative Momentum (PNM) approach that is a
powerful alternative to conventional Momentum in classic optimizers. The
introduced PNM method maintains two approximate independent momentum terms.
Then, we can control the magnitude of SGN explicitly by adjusting the momentum
difference. We theoretically prove the convergence guarantee and the
generalization advantage of PNM over Stochastic Gradient Descent (SGD). By
incorporating PNM into the two conventional optimizers, SGD with Momentum and
Adam, our extensive experiments empirically verified the significant advantage
of the PNM-based variants over the corresponding conventional Momentum-based
optimizers. Code: \url{https://github.com/zeke-xie/Positive-Negative-Momentum}.
Related papers
- Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results [60.92029979853314]
This paper investigates Gradient Normalization without (NSGDC) its gradient reduction variant (NSGDC-VR)
We present significant improvements in the theoretical results for both algorithms.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Application of Langevin Dynamics to Advance the Quantum Natural Gradient Optimization Algorithm [47.47843839099175]
A Quantum Natural Gradient (QNG) algorithm for optimization of variational quantum circuits has been proposed recently.
In this study, we employ the Langevin equation with a QNG force to demonstrate that its discrete-time solution gives a generalized form, which we call Momentum-QNG.
arXiv Detail & Related papers (2024-09-03T15:21:16Z) - Exact Gauss-Newton Optimization for Training Deep Neural Networks [0.0]
We present EGN, a second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction.
We show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm.
arXiv Detail & Related papers (2024-05-23T10:21:05Z) - NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD)
We show that this effect induces an asymmetric heavy-tailed noise on gradient updates.
We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z) - Plug-And-Play Learned Gaussian-mixture Approximate Message Passing [71.74028918819046]
We propose a plug-and-play compressed sensing (CS) recovery algorithm suitable for any i.i.d. source prior.
Our algorithm builds upon Borgerding's learned AMP (LAMP), yet significantly improves it by adopting a universal denoising function within the algorithm.
Numerical evaluation shows that the L-GM-AMP algorithm achieves state-of-the-art performance without any knowledge of the source prior.
arXiv Detail & Related papers (2020-11-18T16:40:45Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Bayesian Sparse learning with preconditioned stochastic gradient MCMC
and its applications [5.660384137948734]
The proposed algorithm converges to the correct distribution with a controllable bias under mild conditions.
We show that the proposed algorithm canally converge to the correct distribution with a controllable bias under mild conditions.
arXiv Detail & Related papers (2020-06-29T20:57:20Z) - On the Promise of the Stochastic Generalized Gauss-Newton Method for
Training DNNs [37.96456928567548]
We study a generalized Gauss-Newton method (SGN) for training DNNs.
SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge.
We show that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime.
This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform.
arXiv Detail & Related papers (2020-06-03T17:35:54Z) - Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient
Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients.
We prove new complexity that outperform state-of-the-art results in this case.
We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.