Related papers: Geometric and Dynamic Scaling in Deep Transformers

Geometric and Dynamic Scaling in Deep Transformers

URL: http://arxiv.org/abs/2601.01014v2
Date: Tue, 06 Jan 2026 01:35:54 GMT
Title: Geometric and Dynamic Scaling in Deep Transformers
Authors: Haoran Su, Chenyu You,
Abstract summary: We argue that the collapse of deep Transformers is fundamentally a geometric problem.<n>We propose a unified geometric framework that addresses these failures through two principles.<n>Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks.
Score: 13.697614668609205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric framework that addresses these failures through two orthogonal principles. First, manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing uncontrolled manifold drift. Second, deep delta learning introduces data-dependent, non-monotonic updates that enable reflection and erasure of redundant features rather than their unconditional accumulation. Together, these mechanisms decouple the direction and sign of feature updates, yielding a stable geometric evolution across depth. We term the resulting architecture the Manifold-Geometric Transformer (MGT). Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks. We outline an evaluation protocol for Transformers exceeding 100 layers to test the hypothesis that geometry, rather than depth itself, is the key limiting factor in deep representation learning.

Related papers

The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization [57.37943479039033]
We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent.<n>We show that locality and weight sharing fundamentally change this picture.
arXiv Detail & Related papers (2026-03-05T04:50:51Z)
Random-Matrix-Induced Simplicity Bias in Over-parameterized Variational Quantum Circuits [72.0643009153473]
We show that expressive variational ansatze enter a Haar-like universality class in which both observable expectation values and parameter gradients concentrate exponentially with system size.<n>As a consequence, the hypothesis class induced by such circuits collapses with high probability to a narrow family of near-constant functions.<n>We further show that this collapse is not unavoidable: tensor-structured VQCs, including tensor-network-based and tensor-hypernetwork parameterizations, lie outside the Haar-like universality class.
arXiv Detail & Related papers (2026-01-05T08:04:33Z)
Deep Delta Learning [91.75868893250662]
We introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection.<n>We provide a spectral analysis of this operator, demonstrating that the gate $(mathbfX)$ enables dynamic between identity mapping, projection, and geometric reflection.<n>This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics.
arXiv Detail & Related papers (2026-01-01T18:11:38Z)
Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics [9.885471525709113]
We show that scaling laws describe what success looks like but not when and why scaling succeeds or fails.<n>A central obstacle is the lack of a rigorous understanding of feature learning at large depth.
arXiv Detail & Related papers (2025-12-24T09:39:04Z)
Confidence is Not Competence [7.094715131203088]
We analyze the geometry of internal states across two phases - pre-generative assessment and solution execution.<n>A sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap.
arXiv Detail & Related papers (2025-10-24T17:22:48Z)
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility [90.894232610821]
We analyze Transformers through the lens of rank structure.<n>We show that time-series embeddings exhibit sharply decaying singular value spectra.<n>We prove that the associated $Q/K/V$ projections admit accurate low-rank approximations.
arXiv Detail & Related papers (2025-10-02T23:56:17Z)
Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It [5.89889361990138]
We argue that existing sharpness measures fail for transformers because they have much richer symmetries in their attention mechanism.<n>We propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold.<n>We show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.
arXiv Detail & Related papers (2025-05-08T16:51:03Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Curve Your Attention: Mixed-Curvature Transformers for Graph Representation Learning [77.1421343649344]
We propose a generalization of Transformers towards operating entirely on the product of constant curvature spaces. We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges.
arXiv Detail & Related papers (2023-09-08T02:44:37Z)
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse [11.486545294602697]
We shed new light on the causes and effects of rank collapse in Transformers. We show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish.
arXiv Detail & Related papers (2022-06-07T09:07:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.