Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences
- URL: http://arxiv.org/abs/2602.05927v1
- Date: Thu, 05 Feb 2026 17:37:41 GMT
- Title: Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences
- Authors: Siquan Li, Yao Tong, Haonan Wang, Tianyang Hu,
- Abstract summary: We show that randomly untrained models display extreme token preferences across random input sequences.<n>We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction.
- Score: 29.509249228044492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers underpin modern large language models (LLMs) and are commonly assumed to be behaviorally unstructured at random initialization, with all meaningful preferences emerging only through large-scale training. We challenge this assumption by showing that randomly initialized transformers already exhibit strong and systematic structural biases. In particular, untrained models display extreme token preferences: across random input sequences, certain tokens are predicted with probabilities orders of magnitude larger. We provide a mechanistic explanation for this phenomenon by dissecting the transformer architecture at initialization. We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction. This contraction is driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers induce global (inter-sequence) representation concentration, and (ii) self-attention further amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms align hidden representations along a direction determined solely by the random initialization, producing highly non-uniform next-token predictions. Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist throughout training, forming a stable and intrinsic model identity. Leveraging this property, we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that differ only in their random initialization, even after extensive training and under substantial distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the attention mechanism's intra-sequence contraction that is causally linked to the attention-sink phenomenon. This discovery provides a principled explanation for the emergence of sinks and offers a pathway for their control.
Related papers
- Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration [52.017716672255524]
Uncertainty calibration in pre-trained transformers is critical for their reliable deployment in risk-sensitive applications.<n>We propose a diffusion-inspired reconfiguration of transformers in which each feature transformation block is modeled as a probabilistic mapping.<n>Our method achieves superior calibration and predictive accuracy compared to existing uncertainty-aware transformers.
arXiv Detail & Related papers (2026-02-09T17:24:47Z) - Clustering in Deep Stochastic Transformers [10.988655177671255]
Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point.<n>We analyze deep Transformers where noise arises from the random value of value.<n>For two tokens, we prove a phase transition governed by the interaction strength and the token dimension.
arXiv Detail & Related papers (2026-01-29T16:28:13Z) - TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z) - Random-Matrix-Induced Simplicity Bias in Over-parameterized Variational Quantum Circuits [72.0643009153473]
We show that expressive variational ansatze enter a Haar-like universality class in which both observable expectation values and parameter gradients concentrate exponentially with system size.<n>As a consequence, the hypothesis class induced by such circuits collapses with high probability to a narrow family of near-constant functions.<n>We further show that this collapse is not unavoidable: tensor-structured VQCs, including tensor-network-based and tensor-hypernetwork parameterizations, lie outside the Haar-like universality class.
arXiv Detail & Related papers (2026-01-05T08:04:33Z) - From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers [67.02076505996284]
We study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other.<n>Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.
arXiv Detail & Related papers (2025-12-21T08:10:26Z) - Emergence of Quantised Representations Isolated to Anisotropic Functions [0.0]
This paper presents a novel methodology for determining representational structure, which builds upon the existing Spotlight Resonance method.<n>It shows how discrete representations can emerge and organise in autoencoder models, through a controlled ablation study in which only the activation function is altered.<n>Using this technique, the validity of whether function-driven symmetries can act as implicit inductive biases on representations is determined.
arXiv Detail & Related papers (2025-07-16T09:27:54Z) - Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z) - A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer.<n>We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z) - Unsupervised Representation Learning from Sparse Transformation Analysis [79.94858534887801]
We propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components.
Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model.
arXiv Detail & Related papers (2024-10-07T23:53:25Z) - An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models [64.87562101662952]
We show that input tokens are often exchangeable since they already include positional encodings.
We establish the existence of a sufficient and minimal representation of input tokens.
We prove that attention with the desired parameter infers the latent posterior up to an approximation error.
arXiv Detail & Related papers (2022-12-30T17:59:01Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.