Related papers: On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

URL: http://arxiv.org/abs/2602.16849v1
Date: Wed, 18 Feb 2026 20:25:13 GMT
Title: On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking
Authors: Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang,
Abstract summary: We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task.<n>Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics.
Score: 49.1352577985191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.

Related papers

Why Neural Network Can Discover Symbolic Structures with Gradient-based Training: An Algebraic and Geometric Foundation for Neurosymbolic Reasoning [73.18052192964349]
We develop a theoretical framework that explains how discrete symbolic structures can emerge naturally from continuous neural network training dynamics.<n>By lifting neural parameters to a measure space and modeling training as Wasserstein gradient flow, we show that under geometric constraints, the parameter measure $mu_t$ undergoes two concurrent phenomena.
arXiv Detail & Related papers (2025-06-26T22:40:30Z)
Similarity Matching Networks: Hebbian Learning and Convergence Over Multiple Time Scales [5.093257685701887]
We consider and analyze the emphsimilarity matching network for principal subspace projection.<n>By leveraging a multilevel optimization framework, we prove convergence of the dynamics in the offline setting.
arXiv Detail & Related papers (2025-06-06T14:46:22Z)
Uncovering Magnetic Phases with Synthetic Data and Physics-Informed Training [0.0]
We investigate the efficient learning of magnetic phases using artificial neural networks trained on synthetic data.<n>We incorporate two key forms of physics-informed guidance to enhance model performance.<n>Our results show that synthetic, structured, and computationally efficient training schemes can reveal physically meaningful phase boundaries.
arXiv Detail & Related papers (2025-05-15T15:16:16Z)
A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities [30.737171081270322]
We study how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
arXiv Detail & Related papers (2024-10-24T17:24:34Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Localization, fractality, and ergodicity in a monitored qubit [0.5892638927736115]
We study the statistical properties of a single two-level system (qubit) subject to repetitive ancilla-based measurements. This setup is a fundamental minimal model for exploring the interplay between the unitary dynamics of the system and the nonunitaryity introduced by quantum measurements.
arXiv Detail & Related papers (2023-10-03T12:10:30Z)
Onset of scrambling as a dynamical transition in tunable-range quantum circuits [0.0]
We identify a dynamical transition marking the onset of scrambling in quantum circuits with different levels of long-range connectivity. We show that as a function of the interaction range for circuits of different structures, the tripartite mutual information exhibits a scaling collapse. In addition to systems with conventional power-law interactions, we identify the same phenomenon in deterministic, sparse circuits.
arXiv Detail & Related papers (2023-04-19T17:37:10Z)
Third quantization of open quantum systems: new dissipative symmetries and connections to phase-space and Keldysh field theory formulations [77.34726150561087]
We reformulate the technique of third quantization in a way that explicitly connects all three methods. We first show that our formulation reveals a fundamental dissipative symmetry present in all quadratic bosonic or fermionic Lindbladians. For bosons, we then show that the Wigner function and the characteristic function can be thought of as ''wavefunctions'' of the density matrix.
arXiv Detail & Related papers (2023-02-27T18:56:40Z)
A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer Neural Networks [49.870593940818715]
We study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed. Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors.
arXiv Detail & Related papers (2022-10-28T17:26:27Z)
Learning the ground state of a non-stoquastic quantum Hamiltonian in a rugged neural network landscape [0.0]
We investigate a class of universal variational wave-functions based on artificial neural networks. In particular, we show that in the present setup the neural network expressivity and Monte Carlo sampling are not primary limiting factors.
arXiv Detail & Related papers (2020-11-23T05:25:47Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.