A Residual-Aware Theory of Position Bias in Transformers
- URL: http://arxiv.org/abs/2602.16837v1
- Date: Wed, 18 Feb 2026 20:01:39 GMT
- Title: A Residual-Aware Theory of Position Bias in Transformers
- Authors: Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, Sören Laue,
- Abstract summary: We show that Transformer models systematically favor certain token positions.<n>We prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens.<n>This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.
- Score: 2.9332247106953098
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.
Related papers
- The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization [57.37943479039033]
We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent.<n>We show that locality and weight sharing fundamentally change this picture.
arXiv Detail & Related papers (2026-03-05T04:50:51Z) - Random-Matrix-Induced Simplicity Bias in Over-parameterized Variational Quantum Circuits [72.0643009153473]
We show that expressive variational ansatze enter a Haar-like universality class in which both observable expectation values and parameter gradients concentrate exponentially with system size.<n>As a consequence, the hypothesis class induced by such circuits collapses with high probability to a narrow family of near-constant functions.<n>We further show that this collapse is not unavoidable: tensor-structured VQCs, including tensor-network-based and tensor-hypernetwork parameterizations, lie outside the Haar-like universality class.
arXiv Detail & Related papers (2026-01-05T08:04:33Z) - Geometric and Dynamic Scaling in Deep Transformers [13.697614668609205]
We argue that the collapse of deep Transformers is fundamentally a geometric problem.<n>We propose a unified geometric framework that addresses these failures through two principles.<n>Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks.
arXiv Detail & Related papers (2026-01-03T00:41:46Z) - Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility [90.894232610821]
We analyze Transformers through the lens of rank structure.<n>We show that time-series embeddings exhibit sharply decaying singular value spectra.<n>We prove that the associated $Q/K/V$ projections admit accurate low-rank approximations.
arXiv Detail & Related papers (2025-10-02T23:56:17Z) - Time Symmetry, Retrocausality, and Emergent Collapse: The Tlalpan Interpretation of Quantum Mechanics [51.56484100374058]
The Tlalpan Interpretation (QTI) proposes that the wavefunction collapse is not a primitive, axiomatic rule but an emergent phenomenon.<n>The novelty of QTI lies in its embedding of collapse within the conceptual language of critical phenomena in statistical physics.
arXiv Detail & Related papers (2025-08-25T20:30:56Z) - On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper presents a graph-theoretic framework for analyzing position biases in multilayer positions.<n>Our framework offers a principled foundation for understanding positional interplay in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z) - Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond [17.002793355495136]
We propose the first theoretical explanation of the inefficiency of transformers on TSF tasks.<n>We attribute the mechanism behind it to bf Asymmetric Learning in training attention networks.
arXiv Detail & Related papers (2024-12-08T20:29:06Z) - Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models.
These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights.
We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z) - Transformer Normalisation Layers and the Independence of Semantic Subspaces [17.957364289876548]
We consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution.
We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability.
We observe a 1% rate of circuit collapse when the norms are artificially perturbed by $lesssim$10%.
arXiv Detail & Related papers (2024-06-25T16:16:38Z) - Signal Propagation in Transformers: Theoretical Perspectives and the
Role of Rank Collapse [11.486545294602697]
We shed new light on the causes and effects of rank collapse in Transformers.
We show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish.
arXiv Detail & Related papers (2022-06-07T09:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.