Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization
- URL: http://arxiv.org/abs/2106.11514v1
- Date: Tue, 22 Jun 2021 03:13:23 GMT
- Title: Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization
- Authors: Yizhou Wang, Yue Kang, Can Qin, Yi Xu, Huan Wang, Yulun Zhang, Yun Fu
- Abstract summary: textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
- Score: 89.66571637204012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adaptive gradient methods, such as \textsc{Adam}, have achieved tremendous
success in machine learning. Scaling gradients by square roots of the running
averages of squared past gradients, such methods are able to attain rapid
training of modern deep neural networks. Nevertheless, they are observed to
generalize worse than stochastic gradient descent (\textsc{SGD}) and tend to be
trapped in local minima at an early stage during training. Intriguingly, we
discover that substituting the gradient in the preconditioner term with the
momentumized version in \textsc{Adam} can well solve the issues. The intuition
is that gradient with momentum contains more accurate directional information
and therefore its second moment estimation is a better choice for scaling than
raw gradient's. Thereby we propose \textsc{AdaMomentum} as a new optimizer
reaching the goal of training faster while generalizing better. We further
develop a theory to back up the improvement in optimization and generalization
and provide convergence guarantee under both convex and nonconvex settings.
Extensive experiments on various models and tasks demonstrate that
\textsc{AdaMomentum} exhibits comparable performance to \textsc{SGD} on vision
tasks, and achieves state-of-the-art results consistently on other tasks
including language processing.
Related papers
- How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Penalizing Gradient Norm for Efficiently Improving Generalization in
Deep Learning [13.937644559223548]
How to train deep neural networks (DNNs) to generalize well is a central concern in deep learning.
We propose an effective method to improve the model generalization by penalizing the gradient norm of loss function during optimization.
arXiv Detail & Related papers (2022-02-08T02:03:45Z) - Step-size Adaptation Using Exponentiated Gradient Updates [21.162404996362948]
We show that augmenting a given with an adaptive tuning method of the step-size greatly improves the performance.
We maintain a global step-size scale for the update as well as a gain factor for each coordinate.
We show that our approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule.
arXiv Detail & Related papers (2022-01-31T23:17:08Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.