AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights
- URL: http://arxiv.org/abs/2006.08217v3
- Date: Mon, 18 Jan 2021 14:36:15 GMT
- Title: AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights
- Authors: Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun,
Gyuwan Kim, Youngjung Uh, Jung-Woo Ha
- Abstract summary: Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
- Score: 53.8489656709356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Normalization techniques are a boon for modern deep learning. They let
weights converge more quickly with often better generalization performances. It
has been argued that the normalization-induced scale invariance among the
weights provides an advantageous ground for gradient descent (GD) optimizers:
the effective step sizes are automatically reduced over time, stabilizing the
overall training procedure. It is often overlooked, however, that the
additional introduction of momentum in GD optimizers results in a far more
rapid reduction in effective step sizes for scale-invariant weights, a
phenomenon that has not yet been studied and may have caused unwanted side
effects in the current practice. This is a crucial issue because arguably the
vast majority of modern deep neural networks consist of (1) momentum-based GD
(e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify
that the widely-adopted combination of the two ingredients lead to the
premature decay of effective step sizes and sub-optimal model performances. We
propose a simple and effective remedy, SGDP and AdamP: get rid of the radial
component, or the norm-increasing direction, at each optimizer step. Because of
the scale invariance, this modification only alters the effective step sizes
without changing the effective update directions, thus enjoying the original
convergence properties of GD optimizers. Given the ubiquity of momentum GD and
scale invariance in machine learning, we have evaluated our methods against the
baselines on 13 benchmarks. They range from vision tasks like classification
(e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to
language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks.
We verify that our solution brings about uniform gains in those benchmarks.
Source code is available at https://github.com/clovaai/AdamP.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter
Initialization [3.1153758106426603]
We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not.
We implement the Active version (ours) of widely used and recently published gradient descents, namely SGD with momentum, AdamW, RAdam, and AdaBelief.
arXiv Detail & Related papers (2023-01-24T16:57:00Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.