Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter
Initialization
- URL: http://arxiv.org/abs/2301.10133v1
- Date: Tue, 24 Jan 2023 16:57:00 GMT
- Title: Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter
Initialization
- Authors: Davood Wadi, Marc Fredette, Sylvain Senecal
- Abstract summary: We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not.
We implement the Active version (ours) of widely used and recently published gradient descents, namely SGD with momentum, AdamW, RAdam, and AdaBelief.
- Score: 3.1153758106426603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose ActiveLR, an optimization meta algorithm that localizes the
learning rate, $\alpha$, and adapts them at each epoch according to whether the
gradient at each epoch changes sign or not. This sign-conscious algorithm is
aware of whether from the previous step to the current one the update of each
parameter has been too large or too small and adjusts the $\alpha$ accordingly.
We implement the Active version (ours) of widely used and recently published
gradient descent optimizers, namely SGD with momentum, AdamW, RAdam, and
AdaBelief. Our experiments on ImageNet, CIFAR-10, WikiText-103, WikiText-2, and
PASCAL VOC using different model architectures, such as ResNet and
Transformers, show an increase in generalizability and training set fit, and
decrease in training time for the Active variants of the tested optimizers. The
results also show robustness of the Active variant of these optimizers to
different values of the initial learning rate. Furthermore, the detrimental
effects of using large mini-batch sizes are mitigated. ActiveLR, thus,
alleviates the need for hyper-parameter search for two of the most commonly
tuned hyper-parameters that require heavy time and computational costs to pick.
We encourage AI researchers and practitioners to use the Active variant of
their optimizer of choice for faster training, better generalizability, and
reducing carbon footprint of training deep neural networks.
Related papers
- Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling [27.058009599819012]
We study the connection between optimal learning rates and batch sizes for Adam styles.
We prove that the optimal learning rate first rises and then falls as the batch size increases.
arXiv Detail & Related papers (2024-05-23T13:52:36Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Hyper-Learning for Gradient-Based Batch Size Adaptation [2.944323057176686]
Scheduling the batch size to increase is an effective strategy to control noise when training deep neural networks.
We introduce Arbiter as a new hyper-optimization algorithm to perform batch size adaptations for learnable schedulings.
We demonstrate Arbiter's effectiveness in several illustrative experiments.
arXiv Detail & Related papers (2022-05-17T11:01:14Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Meta-Learning with Adaptive Hyperparameters [55.182841228303225]
We focus on a complementary factor in MAML framework, inner-loop optimization (or fast adaptation)
We propose a new weight update rule that greatly enhances the fast adaptation process.
arXiv Detail & Related papers (2020-10-31T08:05:34Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.