AdaL: Adaptive Gradient Transformation Contributes to Convergences and
Generalizations
- URL: http://arxiv.org/abs/2107.01525v1
- Date: Sun, 4 Jul 2021 02:55:36 GMT
- Title: AdaL: Adaptive Gradient Transformation Contributes to Convergences and
Generalizations
- Authors: Hongwei Zhang and Weidong Zou and Hongbo Zhao and Qi Ming and Tijin
Yan and Yuanqing Xia and Weipeng Cao
- Abstract summary: We propose AdaL, with a transformation on the original gradient.
AdaL accelerates the convergence by amplifying the gradient in the early stage, as well as dampens the oscillation and stabilizes the optimization by shrinking the gradient later.
- Score: 4.991328448898387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive optimization methods have been widely used in deep learning. They
scale the learning rates adaptively according to the past gradient, which has
been shown to be effective to accelerate the convergence. However, they suffer
from poor generalization performance compared with SGD. Recent studies point
that smoothing exponential gradient noise leads to generalization degeneration
phenomenon. Inspired by this, we propose AdaL, with a transformation on the
original gradient. AdaL accelerates the convergence by amplifying the gradient
in the early stage, as well as dampens the oscillation and stabilizes the
optimization by shrinking the gradient later. Such modification alleviates the
smoothness of gradient noise, which produces better generalization performance.
We have theoretically proved the convergence of AdaL and demonstrated its
effectiveness on several benchmarks.
Related papers
- Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results [60.92029979853314]
This paper investigates Gradient Normalization without (NSGDC) its gradient reduction variant (NSGDC-VR)
We present significant improvements in the theoretical results for both algorithms.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Directional Smoothness and Gradient Methods: Convergence and Adaptivity [16.779513676120096]
We develop new sub-optimality bounds for gradient descent that depend on the conditioning of the objective along the path of optimization.
Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective.
We prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness.
arXiv Detail & Related papers (2024-03-06T22:24:05Z) - Signal Processing Meets SGD: From Momentum to Filter [6.751292200515353]
In deep learning, gradient descent (SGD) and its momentum-based variants are widely used for optimization.
We propose a novel optimization method designed to accelerate SGD's convergence without sacrificing generalization.
arXiv Detail & Related papers (2023-11-06T01:41:46Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z) - On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization [80.03647903934723]
We prove adaptive gradient methods in expectation of gradient convergence methods.
Our analyses shed light on better adaptive gradient methods in optimizing non understanding gradient bounds.
arXiv Detail & Related papers (2018-08-16T20:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.