Related papers: Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

URL: http://arxiv.org/abs/2602.16746v1
Date: Wed, 18 Feb 2026 03:57:56 GMT
Title: Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
Authors: Yongzhong Xu,
Abstract summary: Grokking -- the delayed transition from memorization to generalization in small tasks -- remains poorly understood.<n> PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace.<n>We find that curvature grows sharply in directions to the execution subspace while the trajectory remains largely confined to it.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation. All findings replicate across this learning-rate range, a qualitatively different slow regime (lr=5e-5, wd=0.1, 3 layers), and three random seeds, though alignment dynamics differ quantitatively between regimes. Causal intervention experiments establish that orthogonal gradient flow is necessary but not sufficient for grokking: suppressing it prevents generalization with a monotonic dose-response across four operations, while artificially boosting curvature defects has no effect.

Related papers

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology [0.0]
We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp)<n>We identify two independent structural factors in standard Transformers: representational magnitude and data-dependent attention routing.
arXiv Detail & Related papers (2026-03-05T14:41:01Z)
Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression [14.718691362208622]
We show that gradient descent with a simple, non-adaptive increasing step-size schedule achieves exponential convergence for separable logistic regression under a margin condition.<n>We also establish exponential convergence of gradient descent using a lightweight adaptive step-size rule that avoids line search and specialized procedures.
arXiv Detail & Related papers (2026-02-21T19:31:07Z)
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure [0.0]
We study the abrupt transition from memorization to generalization long after near-zero training loss.<n>We extend geometric analysis to multi-task modular arithmetic.<n>Results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space.
arXiv Detail & Related papers (2026-02-19T22:39:55Z)
Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks [0.0]
We investigate the structure of learning dynamics in transformer models through carefully controlled arithmetic tasks.<n>Our results suggest a unifying geometric framework for understanding transformer learning.
arXiv Detail & Related papers (2026-02-11T03:57:46Z)
Spectral Gradient Descent Mitigates Anisotropy-Driven Misalignment: A Case Study in Phase Retrieval [13.218607858857295]
Spectral gradient methods modify gradient updates by preserving directional information while discarding scale.<n>We investigate the mechanisms underlying these gains through a dynamical analysis of a nonlinear phase retrieval model.
arXiv Detail & Related papers (2026-01-30T07:12:58Z)
Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations [57.179679246370114]
We identify the distribution of random perturbations that minimizes the estimator's variance as the perturbation stepsize tends to zero.<n>Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length.
arXiv Detail & Related papers (2025-10-22T19:06:39Z)
Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes [3.637162892228131]
We present a unified framework for understanding and accelerating deep neural networks via training gradient descent (SGD)<n>We introduce a practical diagnostic, the running estimate of the largest Lyapunov exponent, which distinguishes genuine convergence toward stablers.<n>We propose a ghost category extension for standard classifiers that adds auxiliary ghost output nodes so the model gains extra descent directions.
arXiv Detail & Related papers (2025-07-01T17:54:35Z)
Provably Accelerating Ill-Conditioned Low-rank Estimation via Scaled Gradient Descent, Even with Overparameterization [48.65416821017865]
This chapter introduces a new algorithmic approach, dubbed scaled gradient (ScaledGD) It converges linearly at a constant rate independent of the condition number of the low-rank object. It maintains the low periteration cost of gradient descent for a variety of tasks.
arXiv Detail & Related papers (2023-10-09T21:16:57Z)
Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
Deriving Differential Target Propagation from Iterating Approximate Inverses [91.3755431537592]
We show that a particular form of target propagation, relying on learned inverses of each layer, which is differential, gives rise to an update rule which corresponds to an approximate Gauss-Newton gradient-based optimization. We consider several iterative calculations based on local auto-encoders at each layer in order to achieve more precise inversions for more accurate target propagation.
arXiv Detail & Related papers (2020-07-29T22:34:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.