Related papers: Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

URL: http://arxiv.org/abs/2210.11693v1
Date: Fri, 21 Oct 2022 02:37:58 GMT
Title: Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale
Authors: Ran Tian, Ankur P. Parikh
Abstract summary: Amos is a gradient-based system for training deep neural networks. It can be viewed as an Adam with theoretically supported, adaptive learning-rate decay and weight decay.
Score: 16.97880876259831
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos is that it leverages model-specific information to determine the initial learning-rate and decaying schedules. When used for pre-training BERT variants and T5, Amos consistently converges faster than the state-of-the-art settings of AdamW, achieving better validation loss within <=70% training steps and time, while requiring <=51% memory for slot variables. Our code is open-sourced at: https://github.com/google-research/jestimator

Related papers

LoRA Unlearns More and Retains More (Student Abstract) [0.0]
PruneLoRA reduces the need for large-scale parameter updates by applying low-rank updates to the model. We leverage LoRA to selectively modify a subset of the pruned model's parameters, thereby reducing the computational cost, memory requirements and improving the model's ability to retain performance on the remaining classes.
arXiv Detail & Related papers (2024-11-16T16:47:57Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks. We introduce three instances of MARS that leverage preconditioned gradient optimization. Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
A second-order-like optimizer with adaptive gradient scaling for deep learning [13.174512123890016]
INNAprop is an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. On image classification (CIFAR-10, ImageNet) and language modeling (GPT-2), INNAprop consistently matches or outperforms AdamW both in training speed and accuracy.
arXiv Detail & Related papers (2024-10-08T09:58:38Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
How to set AdamW's weight decay as you scale model and dataset size [29.980824873382833]
We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size.
arXiv Detail & Related papers (2024-05-22T14:43:02Z)
The Entropy Enigma: Success and Failure of Entropy Minimization [30.083332640328642]
Entropy minimization (EM) is frequently used to increase the accuracy of classification models when they're faced with new data at test time. We analyze why EM works when adapting a model for a few steps and why it eventually fails after adapting for many steps. We present a method for solving a practical problem: estimating a model's accuracy on a given arbitrary dataset without having access to its labels.
arXiv Detail & Related papers (2024-05-08T12:26:15Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Weight Prediction Boosts the Convergence of AdamW [3.7485728774744556]
We introduce weight prediction into the AdamW to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights.
arXiv Detail & Related papers (2023-02-01T02:58:29Z)
Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization [3.1153758106426603]
We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not. We implement the Active version (ours) of widely used and recently published gradient descents, namely SGD with momentum, AdamW, RAdam, and AdaBelief.
arXiv Detail & Related papers (2023-01-24T16:57:00Z)
MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption [69.76837484008033]
An unresolved problem in Deep Learning is the ability of neural networks to cope with domain shifts during test-time. We combine meta-learning, self-supervision and test-time training to learn to adapt to unseen test distributions. Our approach significantly improves the state-of-the-art results on the CIFAR-10-Corrupted image classification benchmark.
arXiv Detail & Related papers (2021-03-30T09:33:38Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.