The Implicit Regularization of Momentum Gradient Descent with Early
Stopping
- URL: http://arxiv.org/abs/2201.05405v1
- Date: Fri, 14 Jan 2022 11:50:54 GMT
- Title: The Implicit Regularization of Momentum Gradient Descent with Early
Stopping
- Authors: Li Wang (1), Yingcong Zhou (2), Zhiguo Fu (1) ((1) Northeast Normal
University, (2) Beihua University)
- Abstract summary: We characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $ell$-regularization (ridge)
In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The study on the implicit regularization induced by gradient-based
optimization is a longstanding pursuit. In the present paper, we characterize
the implicit regularization of momentum gradient descent (MGD) with early
stopping by comparing with the explicit $\ell_2$-regularization (ridge). In
details, we study MGD in the continuous-time view, so-called momentum gradient
flow (MGF), and show that its tendency is closer to ridge than the gradient
descent (GD) [Ali et al., 2019] for least squares regression. Moreover, we
prove that, under the calibration $t=\sqrt{2/\lambda}$, where $t$ is the time
parameter in MGF and $\lambda$ is the tuning parameter in ridge regression, the
risk of MGF is no more than 1.54 times that of ridge. In particular, the
relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal
tuning. The numerical experiments support our theoretical results strongly.
Related papers
- Optimal Rates in Continual Linear Regression via Increasing Regularization [36.31850682655034]
We study realizable continual linear regression under random task orderings.<n>In this setup, the worst-case expected loss after $k$ learning admits a lower bound of $Omega (1/k)$.<n>We use two frequently used regularization schemes: explicit isotropic $ell$ regularization, and implicit regularization via finite step budgets.
arXiv Detail & Related papers (2025-06-06T19:51:14Z) - A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation [12.321507997896218]
We study the dynamics of gradient flow with small weight decay on general training losses $F: mathbbRd to mathbbR$.
arXiv Detail & Related papers (2025-05-26T16:12:45Z) - Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation [51.248784084461334]
We prove new convergence rates for a generalized version of Nesterov acceleration underrho conditions.
Our analysis reduces the dependence on the strong growth constant from $$ to $sqrt$ as compared to prior work.
arXiv Detail & Related papers (2024-04-03T00:41:19Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Implicit regularization in Heavy-ball momentum accelerated stochastic
gradient descent [9.191350169972262]
We show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss.
This implicit regularizer for (GD+M) is stronger than that of (GD) by factor of $(frac1+beta1-beta)$.
We extend our analysis to the version of descent with momentum (SGD+M) and characterize the continuous trajectory of the update of gradient (SGD+M) in a pointwise sense.
arXiv Detail & Related papers (2023-02-02T03:25:38Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Fast Margin Maximization via Dual Acceleration [52.62944011696364]
We present and analyze a momentum-based method for training linear classifiers with an exponentially-tailed loss.
This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual.
arXiv Detail & Related papers (2021-07-01T16:36:39Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Escaping Saddle Points Faster with Stochastic Momentum [9.485782209646445]
In deep networks, momentum appears to significantly improve convergence time.
We show that momentum improves deep training because it modifies SGD to escape points faster.
We also show how to choose the ideal momentum parameter.
arXiv Detail & Related papers (2021-06-05T23:34:02Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - The Implicit Regularization of Stochastic Gradient Flow for Least
Squares [24.976079444818552]
We study the implicit regularization of mini-batch gradient descent, when applied to the fundamental problem of least squares regression.
We leverage a continuous-time differential equation having the same moments as gradient descent, which we call gradient flow.
We give a bound on the excess risk of gradient flow at time $t$, over ridge regression with tuning parameter $lambda = 1/t$.
arXiv Detail & Related papers (2020-03-17T16:37:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.