Related papers: A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

URL: http://arxiv.org/abs/2505.20172v1
Date: Mon, 26 May 2025 16:12:45 GMT
Title: A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation
Authors: Etienne Boursier, Scott Pesme, Radu-Alexandru Dragomir,
Abstract summary: We study the dynamics of gradient flow with small weight decay on general training losses $F: mathbbRd to mathbbR$.
Score: 12.321507997896218
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the dynamics of gradient flow with small weight decay on general training losses $F: \mathbb{R}^d \to \mathbb{R}$. Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay $\lambda$ exhibits a two-phase behaviour as $\lambda \to 0$. During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of $F$. Then, at time of order $1/\lambda$, the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the $\ell_2$-norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks.

Related papers

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks [21.176224458126285]
We use a continuous-time approach in the analysis of momentum gradient descent with step size $gamma$ and momentum parameter $beta$. We prove that small values of $lambda$ help to recover sparse solutions.
arXiv Detail & Related papers (2024-03-08T13:21:07Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)
The Implicit Regularization of Momentum Gradient Descent with Early Stopping [0.0]
We characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $ell$-regularization (ridge) In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning.
arXiv Detail & Related papers (2022-01-14T11:50:54Z)
Fast Margin Maximization via Dual Acceleration [52.62944011696364]
We present and analyze a momentum-based method for training linear classifiers with an exponentially-tailed loss. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual.
arXiv Detail & Related papers (2021-07-01T16:36:39Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
A Dynamical Central Limit Theorem for Shallow Neural Networks [48.66103132697071]
We prove that the fluctuations around the mean limit remain bounded in mean square throughout training. If the mean-field dynamics converges to a measure that interpolates the training data, we prove that the deviation eventually vanishes in the CLT scaling.
arXiv Detail & Related papers (2020-08-21T18:00:50Z)
On regularization of gradient descent, layer imbalance and flat minima [9.08659783613403]
We analyze the training dynamics for deep linear networks using a new metric - imbalance - which defines the flatness of a solution. We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way.
arXiv Detail & Related papers (2020-07-18T00:09:14Z)
The Implicit Regularization of Stochastic Gradient Flow for Least Squares [24.976079444818552]
We study the implicit regularization of mini-batch gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time differential equation having the same moments as gradient descent, which we call gradient flow. We give a bound on the excess risk of gradient flow at time $t$, over ridge regression with tuning parameter $lambda = 1/t$.
arXiv Detail & Related papers (2020-03-17T16:37:25Z)
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.