Step-size Adaptation Using Exponentiated Gradient Updates
- URL: http://arxiv.org/abs/2202.00145v1
- Date: Mon, 31 Jan 2022 23:17:08 GMT
- Title: Step-size Adaptation Using Exponentiated Gradient Updates
- Authors: Ehsan Amid, Rohan Anil, Christopher Fifty, Manfred K. Warmuth
- Abstract summary: We show that augmenting a given with an adaptive tuning method of the step-size greatly improves the performance.
We maintain a global step-size scale for the update as well as a gain factor for each coordinate.
We show that our approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule.
- Score: 21.162404996362948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optimizers like Adam and AdaGrad have been very successful in training
large-scale neural networks. Yet, the performance of these methods is heavily
dependent on a carefully tuned learning rate schedule. We show that in many
large-scale applications, augmenting a given optimizer with an adaptive tuning
method of the step-size greatly improves the performance. More precisely, we
maintain a global step-size scale for the update as well as a gain factor for
each coordinate. We adjust the global scale based on the alignment of the
average gradient and the current gradient vectors. A similar approach is used
for updating the local gain factors. This type of step-size scale tuning has
been done before with gradient descent updates. In this paper, we update the
step-size scale and the gain variables with exponentiated gradient updates
instead. Experimentally, we show that our approach can achieve compelling
accuracy on standard models without using any specially tuned learning rate
schedule. We also show the effectiveness of our approach for quickly adapting
to distribution shifts in the data during training.
Related papers
- Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training.
By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes.
This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.