Scaling transition from momentum stochastic gradient descent to plain
stochastic gradient descent
- URL: http://arxiv.org/abs/2106.06753v1
- Date: Sat, 12 Jun 2021 11:42:04 GMT
- Title: Scaling transition from momentum stochastic gradient descent to plain
stochastic gradient descent
- Authors: Kun Zeng, Jinlan Liu, Zhixia Jiang, Dongpo Xu
- Abstract summary: The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters.
The plain gradient descent has not been corrected by the accumulated gradient.
The TSGD algorithm has faster training speed, higher accuracy and better stability.
- Score: 1.7874193862154875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The plain stochastic gradient descent and momentum stochastic gradient
descent have extremely wide applications in deep learning due to their simple
settings and low computational complexity. The momentum stochastic gradient
descent uses the accumulated gradient as the updated direction of the current
parameters, which has a faster training speed. Because the direction of the
plain stochastic gradient descent has not been corrected by the accumulated
gradient. For the parameters that currently need to be updated, it is the
optimal direction, and its update is more accurate. We combine the advantages
of the momentum stochastic gradient descent with fast training speed and the
plain stochastic gradient descent with high accuracy, and propose a scaling
transition from momentum stochastic gradient descent to plain stochastic
gradient descent(TSGD) method. At the same time, a learning rate that decreases
linearly with the iterations is used instead of a constant learning rate. The
TSGD algorithm has a larger step size in the early stage to speed up the
training, and training with a smaller step size in the later stage can steadily
converge. Our experimental results show that the TSGD algorithm has faster
training speed, higher accuracy and better stability. Our implementation is
available at: https://github.com/kunzeng/TSGD.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware
Quantization Training [12.400950982075948]
Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources.
Traditional loss-aware quantization methods commonly use the quantized gradient to replace the full-precision gradient.
This paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction.
arXiv Detail & Related papers (2024-01-30T05:42:54Z) - One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm.
We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence.
Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases.
In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point.
We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z) - SSGD: A safe and efficient method of gradient descent [0.5099811144731619]
gradient descent method plays an important role in solving various optimization problems.
Super gradient descent approach to update parameters by concealing the length of gradient.
Our algorithm can defend against attacks on the gradient.
arXiv Detail & Related papers (2020-12-03T17:09:20Z) - Anderson acceleration of coordinate descent [5.794599007795348]
On multiple Machine Learning problems, coordinate descent achieves performance significantly superior to full-gradient methods.
We propose an accelerated version of coordinate descent using extrapolation, showing considerable speed up in practice.
arXiv Detail & Related papers (2020-11-19T19:01:48Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.