The AdEMAMix Optimizer: Better, Faster, Older
- URL: http://arxiv.org/abs/2409.03137v2
- Date: Fri, 27 Sep 2024 18:31:02 GMT
- Title: The AdEMAMix Optimizer: Better, Faster, Older
- Authors: Matteo Pagliardini, Pierre Ablin, David Grangier,
- Abstract summary: This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal.
We propose AdEMAMix, a simple modification of the Adam with a mixture of two EMAs to better take advantage of past gradients.
Our experiments on language modeling and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps.
- Score: 24.470432924661324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a $1.3$B parameter AdEMAMix LLM trained on $101$B tokens performs comparably to an AdamW model trained on $197$B tokens ($+95\%$). Moreover, our method significantly slows-down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.
Related papers
- EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes [15.18685417164164]
Bias-Corrected Exponential Moving Average (BEMA)<n>We show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training.<n>BEMA is a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
arXiv Detail & Related papers (2025-07-31T21:49:20Z) - Can Gradient Descent Simulate Prompting? [56.60154660021178]
gradient updates the effects of conditioning on new information.<n> gradient descent training recovers some (and occasionally all) of prompted model performance.<n>Results suggest new avenues for long-context modeling.
arXiv Detail & Related papers (2025-06-26T04:06:20Z) - Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants [5.08749017242817]
We show that AdEMAMix most closely resembles accelerated versions of gradient descent.
We introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance across both large and small batch-size settings.
arXiv Detail & Related papers (2025-02-04T15:55:35Z) - Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits [11.801688624472009]
We present a systematic study of the Exponential Moving Average (EMA) of weights.
We show that EMA solutions differ from last-iterate solutions.
We suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
arXiv Detail & Related papers (2024-11-27T19:14:27Z) - Classifier-guided Gradient Modulation for Enhanced Multimodal Learning [50.7008456698935]
Gradient-Guided Modulation (CGGM) is a novel method to balance multimodal learning with gradients.
We conduct extensive experiments on four multimodal datasets: UPMC-Food 101, CMU-MOSI, IEMOCAP and BraTS.
CGGM outperforms all the baselines and other state-of-the-art methods consistently.
arXiv Detail & Related papers (2024-11-03T02:38:43Z) - Switch EMA: A Free Lunch for Better Flatness and Sharpness [58.55452862747021]
This work unveils the full potential of EMA with a single line of modification, i.e., switching parameters to the original model after each epoch, dubbed as Switch (SEMA)
From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness.
arXiv Detail & Related papers (2024-02-14T15:28:42Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - HyperMAML: Few-Shot Adaptation of Deep Models with Hypernetworks [0.0]
Few-Shot learning aims to train models which can easily adapt to previously unseen tasks.
Model-Agnostic Meta-Learning (MAML) is one of the most popular Few-Shot learning approaches.
In this paper, we propose HyperMAML, where the training of the update procedure is also part of the model.
arXiv Detail & Related papers (2022-05-31T12:31:21Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Staircase Sign Method for Boosting Adversarial Attacks [123.19227129979943]
Crafting adversarial examples for the transfer-based attack is challenging and remains a research hot spot.
We propose a novel Staircase Sign Method (S$2$M) to alleviate this issue, thus boosting transfer-based attacks.
Our method can be generally integrated into any transfer-based attacks, and the computational overhead is negligible.
arXiv Detail & Related papers (2021-04-20T02:31:55Z) - In-Loop Meta-Learning with Gradient-Alignment Reward [34.1954698584925]
We present a cheap-to-compute and memory-saving reward, the gradient-alignment reward (GAR), that can guide the optimization.
First, we present the application of GAR to choosing the data distribution as a mixture of multiple dataset splits in a small scale setting.
Second, we show that it can successfully guide learning augmentation strategies competitive with state-of-the-art augmentation strategies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2021-02-05T16:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.