Related papers: Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

URL: http://arxiv.org/abs/2411.18704v1
Date: Wed, 27 Nov 2024 19:14:27 GMT
Title: Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits
Authors: Daniel Morales-Brotons, Thijs Vogels, Hadrien Hendrikx,
Abstract summary: We present a systematic study of the Exponential Moving Average (EMA) of weights.<n>We show that EMA solutions differ from last-iterate solutions.<n>We suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
Score: 11.801688624472009
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weight averaging of Stochastic Gradient Descent (SGD) iterates is a popular method for training deep learning models. While it is often used as part of complex training pipelines to improve generalization or serve as a `teacher' model, weight averaging lacks proper evaluation on its own. In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) prediction consistency, iii) calibration and iv) transfer learning. Therefore, we suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.

Related papers

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression [55.2480439325792]
Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models. In this paper, we establish the risk bound of online SGD with EMA for high-dimensional linear regression.
arXiv Detail & Related papers (2025-02-19T21:55:11Z)
Switch EMA: A Free Lunch for Better Flatness and Sharpness [58.55452862747021]
This work unveils the full potential of EMA with a single line of modification, i.e., switching parameters to the original model after each epoch, dubbed as Switch (SEMA) From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness.
arXiv Detail & Related papers (2024-02-14T15:28:42Z)
Exponential weight averaging as damped harmonic motion [13.305570580429489]
The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of quantities in deep learning optimization. In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring. We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call BELAY.
arXiv Detail & Related papers (2023-10-20T23:15:46Z)
How to Scale Your EMA [20.94711634514331]
We provide a scaling rule for optimization in the presence of a model EMA. We show the rule's validity where the model EMA contributes to the optimization of the target model. For Self-Supervised Learning, we enable training of BYOL up to batch size 24,576 without sacrificing performance.
arXiv Detail & Related papers (2023-07-25T20:33:48Z)
Gradient Surgery for One-shot Unlearning on Generative Model [0.989293617504294]
We introduce a simple yet effective approach to remove a data influence on the deep generative model. Inspired by works in multi-task learning, we propose to manipulate gradients to regularize the interplay of influence among samples.
arXiv Detail & Related papers (2023-07-10T13:29:23Z)
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Meta-Learning with Adaptive Hyperparameters [55.182841228303225]
We focus on a complementary factor in MAML framework, inner-loop optimization (or fast adaptation) We propose a new weight update rule that greatly enhances the fast adaptation process.
arXiv Detail & Related papers (2020-10-31T08:05:34Z)
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate [36.83448475700536]
Recent works suggest that the use of Batch Normalization in today's deep learning can move it far from a traditional optimization viewpoint. This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints. We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
arXiv Detail & Related papers (2020-10-06T17:58:29Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.