Related papers: The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

URL: http://arxiv.org/abs/2602.18523v1
Date: Thu, 19 Feb 2026 22:39:55 GMT
Title: The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
Authors: Yongzhong Xu,
Abstract summary: We study the abrupt transition from memorization to generalization long after near-zero training loss.<n>We extend geometric analysis to multi-task modular arithmetic.<n>Results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude pruning, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.

Related papers

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology [0.0]
We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp)<n>We identify two independent structural factors in standard Transformers: representational magnitude and data-dependent attention routing.
arXiv Detail & Related papers (2026-03-05T14:41:01Z)
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking [0.0]
Grokking -- the delayed transition from memorization to generalization in small tasks -- remains poorly understood.<n> PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace.<n>We find that curvature grows sharply in directions to the execution subspace while the trajectory remains largely confined to it.
arXiv Detail & Related papers (2026-02-18T03:57:56Z)
Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks [0.0]
We investigate the structure of learning dynamics in transformer models through carefully controlled arithmetic tasks.<n>Our results suggest a unifying geometric framework for understanding transformer learning.
arXiv Detail & Related papers (2026-02-11T03:57:46Z)
Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds [0.5729426778193398]
We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens.<n>We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization-like dynamics.<n>The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space.
arXiv Detail & Related papers (2026-01-16T23:11:02Z)
Quantized Visual Geometry Grounded Transformer [67.15451442018258]
This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT.<n>We introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing.<n>We also design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics.
arXiv Detail & Related papers (2025-09-25T15:17:11Z)
Discrete-Guided Diffusion for Scalable and Safe Multi-Robot Motion Planning [56.240199425429445]
Multi-Robot Motion Planning (MPMP) involves generating trajectories for multiple robots operating in a shared continuous workspace.<n>While discrete multi-agent finding (MAPF) methods are broadly adopted due to their scalability, their coarse discretization trajectory quality.<n>This paper tackles limitations of two approaches by introducing discrete MAPF solvers with constrained generative diffusion models.
arXiv Detail & Related papers (2025-08-27T17:59:36Z)
PMGS: Reconstruction of Projectile Motion across Large Spatiotemporal Spans via 3D Gaussian Splatting [9.314869696272297]
This study proposes PMGS, focusing on reconstructing Projectile via 3D Gaussian Splatting.<n>We introduce an acceleration constraint to bridge Newtonian mechanics and pose estimation, and design a dynamic simulated deformation strategy that adaptively schedules learning rates based on motion states.
arXiv Detail & Related papers (2025-08-04T17:49:37Z)
GeoAda: Efficiently Finetune Geometric Diffusion Models with Equivariant Adapters [61.51810815162003]
We propose an SE(3)-equivariant adapter framework ( GeoAda) that enables flexible and parameter-efficient fine-tuning for controlled generative tasks.<n>GeoAda preserves the model's geometric consistency while mitigating overfitting and catastrophic forgetting.<n>We demonstrate the wide applicability of GeoAda across diverse geometric control types, including frame control, global control, subgraph control, and a broad range of application domains.
arXiv Detail & Related papers (2025-07-02T18:44:03Z)
Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z)
Low-Rank Tensor Recovery via Variational Schatten-p Quasi-Norm and Jacobian Regularization [49.85875869048434]
We propose a CP-based low-rank tensor function parameterized by neural networks for implicit neural representation.<n>To achieve sparser CP decomposition, we introduce a variational Schatten-p quasi-norm to prune redundant rank-1 components.<n>For smoothness, we propose a regularization term based on the spectral norm of the Jacobian and Hutchinson's trace estimator.
arXiv Detail & Related papers (2025-06-27T11:23:10Z)
Weight Spectra Induced Efficient Model Adaptation [54.8615621415845]
Fine-tuning large-scale foundation models incurs prohibitive computational costs.<n>We show that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact.<n>We propose a novel method that leverages learnable rescaling of top singular directions.
arXiv Detail & Related papers (2025-05-29T05:03:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.