Early-Warning Signals of Grokking via Loss-Landscape Geometry
- URL: http://arxiv.org/abs/2602.16967v1
- Date: Thu, 19 Feb 2026 00:14:36 GMT
- Title: Early-Warning Signals of Grokking via Loss-Landscape Geometry
- Authors: Yongzhong Xu,
- Abstract summary: We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction.<n>Across both tasks and a wide range of learning rates, the commutator defect rises well before generalization.<n>These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grokking -- the abrupt transition from memorization to generalization after prolonged training -- has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect -- a curvature measure derived from non-commuting gradient updates -- rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity accelerates grokking (roughly 32% on SCAN, roughly 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it. The three task families form a spectrum of causal sensitivity -- modular arithmetic is rigid, Dyck is responsive, SCAN is intermediate -- yet suppression delays or prevents grokking in all cases, establishing necessity as a universal finding. These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.
Related papers
- The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology [0.0]
We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp)<n>We identify two independent structural factors in standard Transformers: representational magnitude and data-dependent attention routing.
arXiv Detail & Related papers (2026-03-05T14:41:01Z) - Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking [0.0]
Grokking -- the delayed transition from memorization to generalization in small tasks -- remains poorly understood.<n> PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace.<n>We find that curvature grows sharply in directions to the execution subspace while the trajectory remains largely confined to it.
arXiv Detail & Related papers (2026-02-18T03:57:56Z) - Generalization Bounds for Transformer Channel Decoders [61.55280736553095]
This paper studies the generalization performance of ECCT from a learning-theoretic perspective.<n>To the best of our knowledge, this work provides the first theoretical generalization guarantees for this class of decoders.
arXiv Detail & Related papers (2026-01-11T15:56:37Z) - Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective [13.434698786044107]
Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction.<n>Its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames.<n>We establish a rigorous mathematical explanation of the collapse by viewing the global-attention as a degenerate diffusion process.
arXiv Detail & Related papers (2025-12-25T14:34:27Z) - Improved Convergence in Parameter-Agnostic Error Feedback through Momentum [49.163769734936295]
We study normalized error feedback algorithms that combine EF with normalized updates, various momentum variants, and parameter-agnostic, time-varying stepsizes.<n>Our results hold with decreasing stepsizes and small mini-batches.
arXiv Detail & Related papers (2025-11-18T13:47:08Z) - The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold [5.076419064097734]
We argue that post-memorization learning can be understood through the lens of constrained optimization.<n>We show that gradient descent effectively minimizes the weight norm on the zero-loss manifold.<n>Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.
arXiv Detail & Related papers (2025-11-02T18:44:42Z) - Mechanistic Insights into Grokking from the Embedding Layer [15.676058752772287]
Grokking, a delayed generalization in neural networks, has been observed in Transformers and stagnates, but the components driving it remain underexplored.<n>We show that embeddings are central to grokking: introducing them intos induces delayed generalization in modular arithmetic tasks.<n>Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
arXiv Detail & Related papers (2025-05-21T15:12:34Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Towards Understanding the Generalizability of Delayed Stochastic Gradient Descent [63.43247232708004]
A gradient descent performed in an asynchronous manner plays a crucial role in training large-scale machine learning models.<n>Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization.<n>Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm.
arXiv Detail & Related papers (2023-08-18T10:00:27Z) - Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning [47.904127007515925]
We study a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction.
We prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic approximation guarantees as their counterparts.
Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling.
arXiv Detail & Related papers (2023-01-03T04:09:38Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.