Related papers: Exponential weight averaging as damped harmonic motion

Exponential weight averaging as damped harmonic motion

URL: http://arxiv.org/abs/2310.13854v1
Date: Fri, 20 Oct 2023 23:15:46 GMT
Title: Exponential weight averaging as damped harmonic motion
Authors: Jonathan Patsenker, Henry Li, Yuval Kluger
Abstract summary: The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of quantities in deep learning optimization. In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring. We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call BELAY.
Score: 13.305570580429489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of stochastic quantities in deep learning optimization. Recently, EMA has seen considerable use in generative models, where it is computed with respect to the model weights, and significantly improves the stability of the inference model during and after training. While the practice of weight averaging at the end of training is well-studied and known to improve estimates of local optima, the benefits of EMA over the course of training is less understood. In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring. We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call BELAY. Finally, we demonstrate theoretically and empirically several advantages enjoyed by BELAY over standard EMA.

Related papers

EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes [15.18685417164164]
Bias-Corrected Exponential Moving Average (BEMA)<n>We show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training.<n>BEMA is a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
arXiv Detail & Related papers (2025-07-31T21:49:20Z)
WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z)
Efficient Federated Learning with Timely Update Dissemination [54.668309196009204]
Federated Learning (FL) has emerged as a compelling methodology for the management of distributed data.<n>We propose an efficient FL approach that capitalizes on additional downlink bandwidth resources to ensure timely update dissemination.
arXiv Detail & Related papers (2025-07-08T14:34:32Z)
An Exponential Averaging Process with Strong Convergence Properties [0.0]
In certain scenarios, observations made along trajectories of random dynamical systems are of particular interest.<n>One popular smoothing technique for such a scenario is exponential moving averaging (EMA), which assigns observations a weight that decreases exponentially in their age.<n>However, EMA fails to enjoy strong convergence properties, which stems from the fact that the weight assigned to the youngest observation is constant over time.<n>We consider an adaptation to EMA, which we call $p$-EMA, where the weights assigned to the last decrease to zero at a subharmonic rate.
arXiv Detail & Related papers (2025-05-15T16:19:58Z)
Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression [55.2480439325792]
Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models. In this paper, we establish the risk bound of online SGD with EMA for high-dimensional linear regression.
arXiv Detail & Related papers (2025-02-19T21:55:11Z)
Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits [11.801688624472009]
We present a systematic study of the Exponential Moving Average (EMA) of weights. We show that EMA solutions differ from last-iterate solutions. We suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
arXiv Detail & Related papers (2024-11-27T19:14:27Z)
Learning Mixtures of Experts with EM [28.48469221248906]
Mixtures of Experts (MoE) are Machine Learning models that involve the input space, with a separate "expert" model trained on each partition. We study the efficiency of the Expectation Maximization (EM) algorithm for the training of MoE models.
arXiv Detail & Related papers (2024-11-09T03:44:09Z)
Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem. PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins. Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z)
Supervised Contrastive Learning based Dual-Mixer Model for Remaining Useful Life Prediction [3.081898819471624]
The Remaining Useful Life (RUL) prediction aims at providing an accurate estimate of the remaining time from the current predicting moment to the complete failure of the device. To overcome the shortcomings of rigid combination for temporal and spatial features in most existing RUL prediction approaches, a spatial-temporal homogeneous feature extractor, named Dual-Mixer model, is proposed. The effectiveness of the proposed method is validated through comparisons with other latest research works on the C-MAPSS dataset.
arXiv Detail & Related papers (2024-01-29T14:38:44Z)
WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking. Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z)
Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference. Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z)
How to Scale Your EMA [20.94711634514331]
We provide a scaling rule for optimization in the presence of a model EMA. We show the rule's validity where the model EMA contributes to the optimization of the target model. For Self-Supervised Learning, we enable training of BYOL up to batch size 24,576 without sacrificing performance.
arXiv Detail & Related papers (2023-07-25T20:33:48Z)
Self-learning locally-optimal hypertuning using maximum entropy, and comparison of machine learning approaches for estimating fatigue life in composite materials [0.0]
We develop an ML nearest-neighbors-alike algorithm based on the principle of maximum entropy to predict fatigue damage. The predictions achieve a good level of accuracy, similar to other ML algorithms.
arXiv Detail & Related papers (2022-10-19T12:20:07Z)
Learning to Re-weight Examples with Optimal Transport for Imbalanced Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models. One of the most widely-used approaches for tackling imbalanced data is re-weighting. We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z)
Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models. In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints. A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z)
Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization [69.07420650261649]
We introduce a novel, simple, and powerful contrastive MI estimator named as FLO. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
arXiv Detail & Related papers (2021-07-02T15:20:41Z)
Learning ergodic averages in chaotic systems [6.85316573653194]
We propose a machine learning method to predict the time average of a chaotic attractor. The method is based on the hybrid echo state network (hESN)
arXiv Detail & Related papers (2020-01-09T18:12:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.