Related papers: Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

URL: http://arxiv.org/abs/2603.01192v2
Date: Tue, 03 Mar 2026 17:17:30 GMT
Title: Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach
Authors: Ben Cullen, Sergio Estan-Ruiz, Riya Danait, Jiayi Li,
Abstract summary: We study grokking, the abrupt transition from memorization to generalisation after extended training.<n>We interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins.<n>Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification.
Score: 3.551701030393209
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient (LLC), a measure of the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging this theory, we interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins. Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification; as well as empirical evidence demonstrating that LLC trajectories provide a reliable tool for tracking generalisation dynamics and interpreting phase transitions during training.

Related papers

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z)
SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training [54.8494905524997]
Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes.<n>We propose SENTINEL, a verification mechanism for pipeline parallelism (PP) training without duplication.<n>Experiments demonstrate successful training of up to 4B- parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.
arXiv Detail & Related papers (2026-03-03T23:51:10Z)
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z)
A simple mean field model of feature learning [2.3215806943173676]
We derive a tractable, self-consistent mean-field (MF) theory for two-layer non-linear networks trained with a gradient Langevin dynamics (SGLD)<n>At infinite width, this theory reduces to kernel ridge regression, but at finite width it predicts symmetry breaking phase transition where networks abruptly align with target functions.<n>While the basic MF theory provides theoretical insight into the emergence of FL in the finite-width regime, semi-quantitatively predicting the onset of FL with noise or sample size, it substantially underestimates the improvements in generalisation after the transition.
arXiv Detail & Related papers (2025-10-16T22:28:44Z)
In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning [51.56484100374058]
We introduce a principled risk decomposition that separates the total ICL risk into two components: Bayes Gap and Posterior Variance.<n>For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts.<n>The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty.
arXiv Detail & Related papers (2025-10-13T03:42:31Z)
Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency [52.52950138164424]
We show that when leveraging the off-the-shelf (vision) foundation models for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets.<n>We embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition.<n>In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes.
arXiv Detail & Related papers (2025-08-19T05:22:59Z)
Hidden Breakthroughs in Language Model Training [9.183934538035562]
This paper argues that similar breakthroughs occur frequently throughout training but are obscured by a loss metric that collapses all variation into a single scalar.<n>We introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace.<n>We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent interpretable breakthroughs in the model's capabilities.
arXiv Detail & Related papers (2025-06-18T20:40:16Z)
Towards Robust Trajectory Representations: Isolating Environmental Confounders with Causal Learning [23.659451444973627]
We present a Trajectory modeling framework (TrajCL) based on Causal Learning. TrajCL markedly enhances performance in trajectory classification tasks while showcasing superior generalization and interpretability.
arXiv Detail & Related papers (2024-04-22T10:34:58Z)
Learning in PINNs: Phase transition, total diffusion, and generalization [1.8802875123957965]
We investigate the learning dynamics of fully-connected neural networks through the lens of gradient signal-to-noise ratio (SNR) We identify a third phase termed total diffusion" We explore the information-induced compression phenomenon, pinpointing a significant compression of activations at the total diffusion phase.
arXiv Detail & Related papers (2024-03-27T12:10:30Z)
Relaxed Contrastive Learning for Federated Learning [48.96253206661268]
We propose a novel contrastive learning framework to address the challenges of data heterogeneity in federated learning. Our framework outperforms all existing federated learning approaches by huge margins on the standard benchmarks.
arXiv Detail & Related papers (2024-01-10T04:55:24Z)
Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization [87.21285093582446]
Diffusion Generative Flow Samplers (DGFS) is a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments. Our method takes inspiration from the theory developed for generative flow networks (GFlowNets)
arXiv Detail & Related papers (2023-10-04T09:39:05Z)
Revisiting Deep Semi-supervised Learning: An Empirical Distribution Alignment Framework and Its Generalization Bound [97.93945601881407]
We propose a new deep semi-supervised learning framework called Semi-supervised Learning by Empirical Distribution Alignment (SLEDA) We show the generalization error of semi-supervised learning can be effectively bounded by minimizing the training error on labeled data. Building upon our new framework and the theoretical bound, we develop a simple and effective deep semi-supervised learning method called Augmented Distribution Alignment Network (ADA-Net)
arXiv Detail & Related papers (2022-03-13T11:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.