Related papers: Clustering in Deep Stochastic Transformers

Clustering in Deep Stochastic Transformers

URL: http://arxiv.org/abs/2601.21942v1
Date: Thu, 29 Jan 2026 16:28:13 GMT
Title: Clustering in Deep Stochastic Transformers
Authors: Lev Fedorov, Michaël E. Sander, Romuald Elie, Pierre Marion, Mathieu Laurière,
Abstract summary: Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point.<n>We analyze deep Transformers where noise arises from the random value of value.<n>For two tokens, we prove a phase transition governed by the interaction strength and the token dimension.
Score: 10.988655177671255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.

Related papers

Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences [29.509249228044492]
We show that randomly untrained models display extreme token preferences across random input sequences.<n>We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction.
arXiv Detail & Related papers (2026-02-05T17:37:41Z)
TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z)
Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z)
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation [8.973965016201822]
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance.<n>In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to instability.<n>Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and gradients.
arXiv Detail & Related papers (2025-05-30T08:18:23Z)
A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer.<n>We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Transformer Normalisation Layers and the Independence of Semantic Subspaces [17.957364289876548]
We consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability. We observe a 1% rate of circuit collapse when the norms are artificially perturbed by $lesssim$10%.
arXiv Detail & Related papers (2024-06-25T16:16:38Z)
Geometric Dynamics of Signal Propagation Predict Trainability of Transformers [22.25628914395565]
We investigate forward signal propagation and gradient back propagation in deep, randomly transformers. Our approach treats the evolution of $n tokens as they propagate through the transformer layers. We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents.
arXiv Detail & Related papers (2024-03-05T01:30:34Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z)
Universality of critical dynamics with finite entanglement [68.8204255655161]
We study how low-energy dynamics of quantum systems near criticality are modified by finite entanglement. Our result establishes the precise role played by entanglement in time-dependent critical phenomena.
arXiv Detail & Related papers (2023-01-23T19:23:54Z)
Reminiscence of classical chaos in driven transmons [117.851325578242]
We show that even off-resonant drives can cause strong modifications to the structure of the transmon spectrum rendering a large part of it chaotic. Results lead to a photon number threshold characterizing the appearance of chaos-induced quantum demolition effects.
arXiv Detail & Related papers (2022-07-19T16:04:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.