Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression
- URL: http://arxiv.org/abs/2502.14123v1
- Date: Wed, 19 Feb 2025 21:55:11 GMT
- Title: Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression
- Authors: Xuheng Li, Quanquan Gu,
- Abstract summary: Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models.
In this paper, we establish the risk bound of online SGD with EMA for high-dimensional linear regression.
- Score: 55.2480439325792
- License:
- Abstract: Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.
Related papers
- Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits [11.801688624472009]
We present a systematic study of the Exponential Moving Average (EMA) of weights.
We show that EMA solutions differ from last-iterate solutions.
We suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
arXiv Detail & Related papers (2024-11-27T19:14:27Z) - Risk Bounds of Accelerated SGD for Overparameterized Linear Regression [75.27846230182885]
Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
arXiv Detail & Related papers (2023-11-23T23:02:10Z) - The Generalization Error of Stochastic Mirror Descent on
Over-Parametrized Linear Models [37.6314945221565]
Deep networks are known to generalize well to unseen data.
Regularization properties ensure interpolating solutions with "good" properties are found.
We present simulation results that validate the theory and introduce two data models.
arXiv Detail & Related papers (2023-02-18T22:23:42Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.