Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks
- URL: http://arxiv.org/abs/2602.10496v2
- Date: Fri, 13 Feb 2026 16:47:38 GMT
- Title: Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks
- Authors: Yongzhong Xu,
- Abstract summary: We investigate the structure of learning dynamics in transformer models through carefully controlled arithmetic tasks.<n>Our results suggest a unifying geometric framework for understanding transformer learning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the geometric structure of learning dynamics in overparameterized transformer models through carefully controlled modular arithmetic tasks. Our primary finding is that despite operating in high-dimensional parameter spaces ($d=128$), transformer training trajectories rapidly collapse onto low-dimensional execution manifolds of dimension $3$--$4$. This dimensional collapse is robust across random seeds and moderate task difficulties, though the orientation of the manifold in parameter space varies between runs. We demonstrate that this geometric structure underlies several empirically observed phenomena: (1) sharp attention concentration emerges as saturation along routing coordinates within the execution manifold, (2) SGD commutators are preferentially aligned with the execution subspace (up to $10\times$ random baseline) early in training, with $>92\%$ of non-commutativity confined to orthogonal staging directions and this alignment decreasing as training converges, and (3) sparse autoencoders capture auxiliary routing structure but fail to isolate execution itself, which remains distributed across the low-dimensional manifold. Our results suggest a unifying geometric framework for understanding transformer learning, where the vast majority of parameters serve to absorb optimization interference while core computation occurs in a dramatically reduced subspace. These findings have implications for interpretability, training curriculum design, and understanding the role of overparameterization in neural network learning.
Related papers
- Structural Action Transformer for 3D Dexterous Manipulation [80.07649565189035]
Cross-embodiment skill transfer is a challenge for high-DoF robotic hands.<n>Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations.<n>This paper proposes a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective.
arXiv Detail & Related papers (2026-03-04T11:38:12Z) - Transformers converge to invariant algorithmic cores [0.0]
GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number across scales.<n>Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.
arXiv Detail & Related papers (2026-02-26T04:09:11Z) - Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking [0.0]
Grokking -- the delayed transition from memorization to generalization in small tasks -- remains poorly understood.<n> PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace.<n>We find that curvature grows sharply in directions to the execution subspace while the trajectory remains largely confined to it.
arXiv Detail & Related papers (2026-02-18T03:57:56Z) - Attention Is Not What You Need [0.0]
We argue that standard multi-head attention is best seen as a form of tensor lifting.<n>We propose an attention-free architecture based on Grassmann flows.
arXiv Detail & Related papers (2025-12-22T14:29:18Z) - From Coefficients to Directions: Rethinking Model Merging with Directional Alignment [66.99062575537555]
We introduce a unified geometric framework, emphMerging with Directional Alignment (method), which aligns directional structures consistently in both the parameter and feature spaces.<n>Our analysis shows that directional alignment improves structural coherence, and extensive experiments across benchmarks, model scales, and task configurations further validate the effectiveness of our approach.
arXiv Detail & Related papers (2025-11-29T08:40:58Z) - The Neural Differential Manifold: An Architecture with Explicit Geometric Structure [8.201374511929538]
This paper introduces the Neural Differential Manifold (NDM), a novel neural network architecture that explicitly incorporates geometric structure into its fundamental design.<n>We analyze the theoretical advantages of this approach, including its potential for more efficient optimization, enhanced continual learning, and applications in scientific discovery and controllable generative modeling.
arXiv Detail & Related papers (2025-10-29T02:24:27Z) - Understanding Post-Training Structural Changes in Large Language Models [3.054513120350576]
Post-training fundamentally alters the behavior of large language models (LLMs)<n>This work focuses on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation.
arXiv Detail & Related papers (2025-09-22T15:03:36Z) - GeoAda: Efficiently Finetune Geometric Diffusion Models with Equivariant Adapters [61.51810815162003]
We propose an SE(3)-equivariant adapter framework ( GeoAda) that enables flexible and parameter-efficient fine-tuning for controlled generative tasks.<n>GeoAda preserves the model's geometric consistency while mitigating overfitting and catastrophic forgetting.<n>We demonstrate the wide applicability of GeoAda across diverse geometric control types, including frame control, global control, subgraph control, and a broad range of application domains.
arXiv Detail & Related papers (2025-07-02T18:44:03Z) - Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z) - The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions [51.68215326304272]
We show that even small perturbations reliably cause otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time.<n>Our findings provide insights into neural network training stability, with practical implications for fine-tuning, model merging, and diversity of model ensembles.
arXiv Detail & Related papers (2025-06-16T08:35:16Z) - Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models [0.0]
Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics.<n>This study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure.<n>The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales.
arXiv Detail & Related papers (2025-01-08T05:24:11Z) - Maintaining Structural Integrity in Parameter Spaces for Parameter Efficient Fine-tuning [78.39310274926535]
Adapting pre-trained foundation models for various downstream tasks has been prevalent in artificial intelligence.<n>To mitigate this, several fine-tuning techniques have been developed to update the pre-trained model weights in a more resource-efficient manner.<n>This paper introduces a generalized parameter-efficient fine-tuning framework, designed for various dimensional parameter space.
arXiv Detail & Related papers (2024-05-23T16:04:42Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - Autoencoders for discovering manifold dimension and coordinates in data
from complex dynamical systems [0.0]
Autoencoder framework combines implicit regularization with internal linear layers and $L$ regularization (weight decay)
We show that this framework can be naturally extended for applications of state-space modeling and forecasting.
arXiv Detail & Related papers (2023-05-01T21:14:47Z) - Manifold Learning via Manifold Deflation [105.7418091051558]
dimensionality reduction methods provide a valuable means to visualize and interpret high-dimensional data.
Many popular methods can fail dramatically, even on simple two-dimensional Manifolds.
This paper presents an embedding method for a novel, incremental tangent space estimator that incorporates global structure as coordinates.
Empirically, we show our algorithm recovers novel and interesting embeddings on real-world and synthetic datasets.
arXiv Detail & Related papers (2020-07-07T10:04:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.