Exponential weight averaging as damped harmonic motion
- URL: http://arxiv.org/abs/2310.13854v1
- Date: Fri, 20 Oct 2023 23:15:46 GMT
- Title: Exponential weight averaging as damped harmonic motion
- Authors: Jonathan Patsenker, Henry Li, Yuval Kluger
- Abstract summary: The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of quantities in deep learning optimization.
In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring.
We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call BELAY.
- Score: 13.305570580429489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exponential moving average (EMA) is a commonly used statistic for
providing stable estimates of stochastic quantities in deep learning
optimization. Recently, EMA has seen considerable use in generative models,
where it is computed with respect to the model weights, and significantly
improves the stability of the inference model during and after training. While
the practice of weight averaging at the end of training is well-studied and
known to improve estimates of local optima, the benefits of EMA over the course
of training is less understood. In this paper, we derive an explicit connection
between EMA and a damped harmonic system between two particles, where one
particle (the EMA weights) is drawn to the other (the model weights) via an
idealized zero-length spring. We then leverage this physical analogy to analyze
the effectiveness of EMA, and propose an improved training algorithm, which we
call BELAY. Finally, we demonstrate theoretically and empirically several
advantages enjoyed by BELAY over standard EMA.
Related papers
- Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression [55.2480439325792]
Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models.
In this paper, we establish the risk bound of online SGD with EMA for high-dimensional linear regression.
arXiv Detail & Related papers (2025-02-19T21:55:11Z) - Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits [11.801688624472009]
We present a systematic study of the Exponential Moving Average (EMA) of weights.
We show that EMA solutions differ from last-iterate solutions.
We suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
arXiv Detail & Related papers (2024-11-27T19:14:27Z) - Learning Mixtures of Experts with EM [28.48469221248906]
Mixtures of Experts (MoE) are Machine Learning models that involve the input space, with a separate "expert" model trained on each partition.
We study the efficiency of the Expectation Maximization (EM) algorithm for the training of MoE models.
arXiv Detail & Related papers (2024-11-09T03:44:09Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference.
Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z) - How to Scale Your EMA [20.94711634514331]
We provide a scaling rule for optimization in the presence of a model EMA.
We show the rule's validity where the model EMA contributes to the optimization of the target model.
For Self-Supervised Learning, we enable training of BYOL up to batch size 24,576 without sacrificing performance.
arXiv Detail & Related papers (2023-07-25T20:33:48Z) - Self-learning locally-optimal hypertuning using maximum entropy, and
comparison of machine learning approaches for estimating fatigue life in
composite materials [0.0]
We develop an ML nearest-neighbors-alike algorithm based on the principle of maximum entropy to predict fatigue damage.
The predictions achieve a good level of accuracy, similar to other ML algorithms.
arXiv Detail & Related papers (2022-10-19T12:20:07Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Tight Mutual Information Estimation With Contrastive Fenchel-Legendre
Optimization [69.07420650261649]
We introduce a novel, simple, and powerful contrastive MI estimator named as FLO.
Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently.
The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
arXiv Detail & Related papers (2021-07-02T15:20:41Z) - Learning ergodic averages in chaotic systems [6.85316573653194]
We propose a machine learning method to predict the time average of a chaotic attractor.
The method is based on the hybrid echo state network (hESN)
arXiv Detail & Related papers (2020-01-09T18:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.