Related papers: Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent

Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent

URL: http://arxiv.org/abs/2106.06749v1
Date: Sat, 12 Jun 2021 11:28:58 GMT
Title: Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent
Authors: Kun Zeng, Jinlan Liu, Zhixia Jiang, Dongpo Xu
Abstract summary: We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda. Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
Score: 1.7874193862154875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Currently, researchers have proposed the adaptive gradient descent algorithm and its variants, such as AdaGrad, RMSProp, Adam, AmsGrad, etc. Although these algorithms have a faster speed in the early stage, the generalization ability in the later stage of training is often not as good as the stochastic gradient descent. Recently, some researchers have combined the adaptive gradient descent and stochastic gradient descent to obtain the advantages of both and achieved good results. Based on this research, we propose a decreasing scaling transition from adaptive gradient descent to stochastic gradient descent method(DSTAda). For the training stage of the stochastic gradient descent, we use a learning rate that decreases linearly with the number of iterations instead of a constant learning rate. We achieve a smooth and stable transition from adaptive gradient descent to stochastic gradient descent through scaling. At the same time, we give a theoretical proof of the convergence of DSTAda under the framework of online learning. Our experimental results show that the DSTAda algorithm has a faster convergence speed, higher accuracy, and better stability and robustness. Our implementation is available at: https://github.com/kunzeng/DSTAdam.

Related papers

Posterior Approximation using Stochastic Gradient Ascent with Adaptive Stepsize [24.464140786923476]
posterior approximation allow nonparametrics such as Dirichlet process mixture to scale up to larger dataset at fractional cost. gradient ascent is a modern approach to machine learning and is widely deployed in the training of deep neural networks. In this work, we explore using gradient ascent as a fast algorithm for the posterior approximation of Dirichlet process mixture.
arXiv Detail & Related papers (2024-12-12T05:33:23Z)
How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought. Exploiting this structure can significantly improve gradient-free optimization schemes. We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z)
ELRA: Exponential learning rate adaption gradient descent optimization method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption. The main idea of the method is to adapt the $alpha by situational awareness. It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z)
One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm. We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z)
On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent [1.7874193862154875]
The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters. The plain gradient descent has not been corrected by the accumulated gradient. The TSGD algorithm has faster training speed, higher accuracy and better stability.
arXiv Detail & Related papers (2021-06-12T11:42:04Z)
Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent. Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z)
Neural gradients are near-lognormal: improved quantized and sparse training [35.28451407313548]
We find that the distribution of neural gradients is approximately lognormal. We suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -- in each case without accuracy.
arXiv Detail & Related papers (2020-06-15T07:00:15Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.