Amos: An Adam-style Optimizer with Adaptive Weight Decay towards
Model-Oriented Scale
- URL: http://arxiv.org/abs/2210.11693v1
- Date: Fri, 21 Oct 2022 02:37:58 GMT
- Title: Amos: An Adam-style Optimizer with Adaptive Weight Decay towards
Model-Oriented Scale
- Authors: Ran Tian, Ankur P. Parikh
- Abstract summary: Amos is a gradient-based system for training deep neural networks.
It can be viewed as an Adam with theoretically supported, adaptive learning-rate decay and weight decay.
- Score: 16.97880876259831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Amos, a stochastic gradient-based optimizer designed for training
deep neural networks. It can be viewed as an Adam optimizer with theoretically
supported, adaptive learning-rate decay and weight decay. A key insight behind
Amos is that it leverages model-specific information to determine the initial
learning-rate and decaying schedules. When used for pre-training BERT variants
and T5, Amos consistently converges faster than the state-of-the-art settings
of AdamW, achieving better validation loss within <=70% training steps and
time, while requiring <=51% memory for slot variables. Our code is open-sourced
at: https://github.com/google-research/jestimator
Related papers
- A second-order-like optimizer with adaptive gradient scaling for deep learning [13.174512123890016]
INNAprop is an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling.
On image classification (CIFAR-10, ImageNet) and language modeling (GPT-2), INNAprop consistently matches or outperforms AdamW both in training speed and accuracy.
arXiv Detail & Related papers (2024-10-08T09:58:38Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - The Entropy Enigma: Success and Failure of Entropy Minimization [30.083332640328642]
Entropy minimization (EM) is frequently used to increase the accuracy of classification models when they're faced with new data at test time.
We analyze why EM works when adapting a model for a few steps and why it eventually fails after adapting for many steps.
We present a method for solving a practical problem: estimating a model's accuracy on a given arbitrary dataset without having access to its labels.
arXiv Detail & Related papers (2024-05-08T12:26:15Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Weight Prediction Boosts the Convergence of AdamW [3.7485728774744556]
We introduce weight prediction into the AdamW to boost its convergence when training the deep neural network (DNN) models.
In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights.
arXiv Detail & Related papers (2023-02-01T02:58:29Z) - Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter
Initialization [3.1153758106426603]
We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not.
We implement the Active version (ours) of widely used and recently published gradient descents, namely SGD with momentum, AdamW, RAdam, and AdaBelief.
arXiv Detail & Related papers (2023-01-24T16:57:00Z) - Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone.
To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data.
Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions.
We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z) - MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption [69.76837484008033]
An unresolved problem in Deep Learning is the ability of neural networks to cope with domain shifts during test-time.
We combine meta-learning, self-supervision and test-time training to learn to adapt to unseen test distributions.
Our approach significantly improves the state-of-the-art results on the CIFAR-10-Corrupted image classification benchmark.
arXiv Detail & Related papers (2021-03-30T09:33:38Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.