EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
- URL: http://arxiv.org/abs/2508.00180v1
- Date: Thu, 31 Jul 2025 21:49:20 GMT
- Title: EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
- Authors: Adam Block, Cyril Zhang,
- Abstract summary: Bias-Corrected Exponential Moving Average (BEMA)<n>We show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training.<n>BEMA is a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
- Score: 15.18685417164164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training in a variety of standard LM benchmarks, making BEMA a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
Related papers
- Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression [55.2480439325792]
Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models.<n>In this paper, we establish the risk bound of online SGD with EMA for high-dimensional linear regression.
arXiv Detail & Related papers (2025-02-19T21:55:11Z) - Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits [11.801688624472009]
We present a systematic study of the Exponential Moving Average (EMA) of weights.<n>We show that EMA solutions differ from last-iterate solutions.<n>We suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
arXiv Detail & Related papers (2024-11-27T19:14:27Z) - The AdEMAMix Optimizer: Better, Faster, Older [24.470432924661324]
This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal.
We propose AdEMAMix, a simple modification of the Adam with a mixture of two EMAs to better take advantage of past gradients.
Our experiments on language modeling and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps.
arXiv Detail & Related papers (2024-09-05T00:13:16Z) - Switch EMA: A Free Lunch for Better Flatness and Sharpness [58.55452862747021]
This work unveils the full potential of EMA with a single line of modification, i.e., switching parameters to the original model after each epoch, dubbed as Switch (SEMA)
From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness.
arXiv Detail & Related papers (2024-02-14T15:28:42Z) - Exponential weight averaging as damped harmonic motion [13.305570580429489]
The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of quantities in deep learning optimization.
In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring.
We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call BELAY.
arXiv Detail & Related papers (2023-10-20T23:15:46Z) - How to Scale Your EMA [20.94711634514331]
We provide a scaling rule for optimization in the presence of a model EMA.
We show the rule's validity where the model EMA contributes to the optimization of the target model.
For Self-Supervised Learning, we enable training of BYOL up to batch size 24,576 without sacrificing performance.
arXiv Detail & Related papers (2023-07-25T20:33:48Z) - Learning with Multiclass AUC: Theory and Algorithms [141.63211412386283]
Area under the ROC curve (AUC) is a well-known ranking metric for problems such as imbalanced learning and recommender systems.
In this paper, we start an early trial to consider the problem of learning multiclass scoring functions via optimizing multiclass AUC metrics.
arXiv Detail & Related papers (2021-07-28T05:18:10Z) - No MCMC for me: Amortized sampling for fast and stable training of
energy-based models [62.1234885852552]
Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty.
We present a simple method for training EBMs at scale using an entropy-regularized generator to amortize the MCMC sampling.
Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training.
arXiv Detail & Related papers (2020-10-08T19:17:20Z) - Contrastive Learning for Debiased Candidate Generation in Large-Scale
Recommender Systems [84.3996727203154]
We show that a popular choice of contrastive loss is equivalent to reducing the exposure bias via inverse propensity weighting.
We further improve upon CLRec and propose Multi-CLRec, for accurate multi-intention aware bias reduction.
Our methods have been successfully deployed in Taobao, where at least four-month online A/B tests and offline analyses demonstrate its substantial improvements.
arXiv Detail & Related papers (2020-05-20T08:15:23Z) - Training Deep Energy-Based Models with f-Divergence Minimization [113.97274898282343]
Deep energy-based models (EBMs) are very flexible in distribution parametrization but computationally challenging.
We propose a general variational framework termed f-EBM to train EBMs using any desired f-divergence.
Experimental results demonstrate the superiority of f-EBM over contrastive divergence, as well as the benefits of training EBMs using f-divergences other than KL.
arXiv Detail & Related papers (2020-03-06T23:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.