Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent
- URL: http://arxiv.org/abs/2106.06749v1
- Date: Sat, 12 Jun 2021 11:28:58 GMT
- Title: Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent
- Authors: Kun Zeng, Jinlan Liu, Zhixia Jiang, Dongpo Xu
- Abstract summary: We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
- Score: 1.7874193862154875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Currently, researchers have proposed the adaptive gradient descent algorithm
and its variants, such as AdaGrad, RMSProp, Adam, AmsGrad, etc. Although these
algorithms have a faster speed in the early stage, the generalization ability
in the later stage of training is often not as good as the stochastic gradient
descent. Recently, some researchers have combined the adaptive gradient descent
and stochastic gradient descent to obtain the advantages of both and achieved
good results. Based on this research, we propose a decreasing scaling
transition from adaptive gradient descent to stochastic gradient descent
method(DSTAda). For the training stage of the stochastic gradient descent, we
use a learning rate that decreases linearly with the number of iterations
instead of a constant learning rate. We achieve a smooth and stable transition
from adaptive gradient descent to stochastic gradient descent through scaling.
At the same time, we give a theoretical proof of the convergence of DSTAda
under the framework of online learning. Our experimental results show that the
DSTAda algorithm has a faster convergence speed, higher accuracy, and better
stability and robustness. Our implementation is available at:
https://github.com/kunzeng/DSTAdam.
Related papers
- How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm.
We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Scaling transition from momentum stochastic gradient descent to plain
stochastic gradient descent [1.7874193862154875]
The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters.
The plain gradient descent has not been corrected by the accumulated gradient.
The TSGD algorithm has faster training speed, higher accuracy and better stability.
arXiv Detail & Related papers (2021-06-12T11:42:04Z) - Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent.
Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z) - Neural gradients are near-lognormal: improved quantized and sparse
training [35.28451407313548]
We find that the distribution of neural gradients is approximately lognormal.
We suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients.
To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -- in each case without accuracy.
arXiv Detail & Related papers (2020-06-15T07:00:15Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.