Related papers: Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent

Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent

URL: http://arxiv.org/abs/2302.00849v1
Date: Thu, 2 Feb 2023 03:25:38 GMT
Title: Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent
Authors: Avrajit Ghosh, He Lyu, Xitong Zhang, Rongrong Wang
Abstract summary: We show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss. This implicit regularizer for (GD+M) is stronger than that of (GD) by factor of $(frac1+beta1-beta)$. We extend our analysis to the version of descent with momentum (SGD+M) and characterize the continuous trajectory of the update of gradient (SGD+M) in a pointwise sense.
Score: 9.191350169972262
License: http://creativecommons.org/licenses/by/4.0/
Abstract: It is well known that the finite step-size ($h$) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter $\beta$ play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?". To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of $(\frac{1+\beta}{1-\beta})$, thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.

Related papers

Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression [28.3662709740417]
In logistic regression, gradient descent (GD) diverges in norm while converging in direction to the maximum $ell$-margin solution. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression.
arXiv Detail & Related papers (2025-02-18T21:04:06Z)
Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling [5.448070998907116]
Forward gradient descent (FGD) has been proposed as a biologically more plausible alternative of gradient descent. In this paper we show that by computing $ell$ FGD steps based on each training sample, this suboptimality factor becomes $d/(ell wedge d)$. We also show that FGD with repeated sampling can adapt to low-dimensional structure in the input distribution.
arXiv Detail & Related papers (2024-11-26T16:28:16Z)
Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation [51.248784084461334]
We prove new convergence rates for a generalized version of Nesterov acceleration underrho conditions. Our analysis reduces the dependence on the strong growth constant from $$ to $sqrt$ as compared to prior work.
arXiv Detail & Related papers (2024-04-03T00:41:19Z)
Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z)
The Implicit Regularization of Momentum Gradient Descent with Early Stopping [0.0]
We characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $ell$-regularization (ridge) In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning.
arXiv Detail & Related papers (2022-01-14T11:50:54Z)
On the Hyperparameters in Stochastic Gradient Descent with Momentum [6.396288020763144]
We present the theoretical analysis for gradient descent with momentum (SGD) in this paper. By we, and the final comparison is introduced, we show why the optimal linear rate for SGD only about the surrogate rate varies with increasing from zero to when the rate increases. Finally, we show the surrogate momentum under the rate has no essential difference with the momentum.
arXiv Detail & Related papers (2021-08-09T11:25:03Z)
Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning. In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z)
On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error. For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z)
SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD) We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error. We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
A New Accelerated Stochastic Gradient Method with Momentum [4.967897656554012]
gradient descent with momentum (Sgdm) use weights that decay exponentially with the iteration times to generate an momentum term. We provide theoretical convergence properties analyses for our method, which show both the exponentially decay weights and our inverse proportionally decay weights can limit the variance of the moving direction of parameters to be optimized to a region.
arXiv Detail & Related papers (2020-05-31T03:04:32Z)
Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives. We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.