The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
- URL: http://arxiv.org/abs/2603.05228v1
- Date: Thu, 05 Mar 2026 14:41:01 GMT
- Title: The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
- Authors: Alper Yıldırım,
- Abstract summary: We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp)<n>We identify two independent structural factors in standard Transformers: representational magnitude and data-dependent attention routing.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task-specific geometric alignment rather than a generic optimization stabilizer, we use non-commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.
Related papers
- The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure [0.0]
We study the abrupt transition from memorization to generalization long after near-zero training loss.<n>We extend geometric analysis to multi-task modular arithmetic.<n>Results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space.
arXiv Detail & Related papers (2026-02-19T22:39:55Z) - Early-Warning Signals of Grokking via Loss-Landscape Geometry [0.0]
We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction.<n>Across both tasks and a wide range of learning rates, the commutator defect rises well before generalization.<n>These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.
arXiv Detail & Related papers (2026-02-19T00:14:36Z) - Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking [0.0]
Grokking -- the delayed transition from memorization to generalization in small tasks -- remains poorly understood.<n> PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace.<n>We find that curvature grows sharply in directions to the execution subspace while the trajectory remains largely confined to it.
arXiv Detail & Related papers (2026-02-18T03:57:56Z) - Parallel Complex Diffusion for Scalable Time Series Generation [50.01609741902786]
PaCoDi is a spectral-native architecture that decouples generative modeling in the frequency domain.<n>We show that PaCoDi outperforms existing baselines in both generation quality and inference speed.
arXiv Detail & Related papers (2026-02-10T14:31:53Z) - Generalizing GNNs with Tokenized Mixture of Experts [75.8310720413187]
We show that improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor.<n>We propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths.<n>Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.
arXiv Detail & Related papers (2026-02-09T22:48:30Z) - Riemannian Flow Matching for Disentangled Graph Domain Adaptation [51.98961391065951]
Graph Domain Adaptation (GDA) typically uses adversarial learning to align graph embeddings in Euclidean space.<n>DisRFM is a geometry-aware GDA framework that unifies embedding and flow-based transport.
arXiv Detail & Related papers (2026-01-31T11:05:35Z) - Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference [1.7523718031184992]
We identify a fundamental mechanism for this failure: textbfPremature Mode Collapse.<n>We propose textbfEfficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC), an adaptive scheduling algorithm that monitors the stability of the inference process.
arXiv Detail & Related papers (2026-01-30T14:47:18Z) - The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss [53.542743390809356]
This paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB)<n>Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function.<n>We present a concrete solution that simultaneously achieves both principles via DFT or DWT.
arXiv Detail & Related papers (2025-12-21T06:08:22Z) - Pay Attention Later: From Vector Space Diffusion to Linearithmic Spectral Phase-Locking [0.0]
Standard Transformers suffer from a "Semantic Alignment Tax"<n>We introduce the Phase-Resonant Intelligent Spectral Model (PRISM)<n>PRISM encodes semantic identity as resonant frequencies in the complex domain (Cd) and replaces quadratic self-attention with linearithmic O(N log N) Gated Harmonic Convolutions.
arXiv Detail & Related papers (2025-12-01T02:46:15Z) - Geometric-Disentangelment Unlearning [106.99160454669902]
gradient ascent on forget samples often harms retained knowledge.<n>We propose the Geometric-disment Unlearning (GU) that decomposes any candidate forget gradient update into tangential and normal components to retain space and executes only the normal component.<n>Our method is plug-and-play and can be attached to existing gradient-based unlearning procedures to mitigate side effects.
arXiv Detail & Related papers (2025-11-21T09:58:25Z) - Tracing the Representation Geometry of Language Models from Pretraining to Post-training [22.18942718274405]
We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training.<n>We uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining.<n>Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data.
arXiv Detail & Related papers (2025-09-27T00:46:29Z) - Pruning Redundant Mappings in Transformer Models via Spectral-Normalized
Identity Prior [54.629850694790036]
spectral-normalized identity priors (SNIP) is a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping.
We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance.
arXiv Detail & Related papers (2020-10-05T05:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.