The emergence of clusters in self-attention dynamics
- URL: http://arxiv.org/abs/2305.05465v6
- Date: Mon, 12 Feb 2024 10:21:08 GMT
- Title: The emergence of clusters in self-attention dynamics
- Authors: Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet
- Abstract summary: We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity.
Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix.
- Score: 24.786862288360076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Viewing Transformers as interacting particle systems, we describe the
geometry of learned representations when the weights are not time dependent. We
show that particles, representing tokens, tend to cluster toward particular
limiting objects as time tends to infinity. Cluster locations are determined by
the initial tokens, confirming context-awareness of representations learned by
Transformers. Using techniques from dynamical systems and partial differential
equations, we show that the type of limiting object that emerges depends on the
spectrum of the value matrix. Additionally, in the one-dimensional case we
prove that the self-attention matrix converges to a low-rank Boolean matrix.
The combination of these results mathematically confirms the empirical
observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence
of tokens when processed by Transformers.
Related papers
- Transformers learn factored representations [62.86679034549244]
Transformers pretrained via next token prediction learn to factor their world into parts.<n>We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts.<n>We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them.<n>This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.
arXiv Detail & Related papers (2026-02-02T17:49:06Z) - Clustering in Deep Stochastic Transformers [10.988655177671255]
Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point.<n>We analyze deep Transformers where noise arises from the random value of value.<n>For two tokens, we prove a phase transition governed by the interaction strength and the token dimension.
arXiv Detail & Related papers (2026-01-29T16:28:13Z) - Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z) - Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling [33.595964789473065]
This work demystifies the Transformer encoder by establishing its fundamental equivalence to a Graph Convolutional Network (GCN)<n>We propose textbfFighter (Flexible Graph Convolutional Transformer), a streamlined architecture that removes redundant linear projections and incorporates multi-hop graph aggregation.
arXiv Detail & Related papers (2025-10-20T02:42:14Z) - A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer.
We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z) - Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity.
We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders.
We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z) - Towards Understanding Inductive Bias in Transformers: A View From Infinity [9.00214539845063]
We argue transformers tend to be biased towards more permutation symmetric functions in sequence space.
We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions.
We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
arXiv Detail & Related papers (2024-02-07T19:00:01Z) - Matrix product state fixed points of non-Hermitian transfer matrices [11.686585954351436]
We investigate the impact of gauge degrees of freedom in the virtual indices of the tensor network on the contraction process.
We show that the gauge transformation can affect the entanglement structures of the eigenstates of the transfer matrix.
arXiv Detail & Related papers (2023-11-30T17:28:30Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Transformers are efficient hierarchical chemical graph learners [7.074125287195362]
SubFormer is a graph transformer that operates on subgraphs that aggregate information by a message-passing mechanism.
We show that SubFormer exhibits limited over-smoothing and avoids over-squashing, which is prevalent in traditional graph neural networks.
arXiv Detail & Related papers (2023-10-02T23:57:04Z) - Mapping of attention mechanisms to a generalized Potts model [50.91742043564049]
We show that training a neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method.
We also compute the generalization error of self-attention in a model scenario analytically using the replica method.
arXiv Detail & Related papers (2023-04-14T16:32:56Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Outliers Dimensions that Disrupt Transformers Are Driven by Frequency [79.22656609637525]
We show that the token frequency contributes to the outlier phenomenon.
We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
arXiv Detail & Related papers (2022-05-23T15:19:09Z) - Sinkformers: Transformers with Doubly Stochastic Attention [22.32840998053339]
We use Sinkhorn's algorithm to make attention matrices doubly. We call the resulting model a Sinkformer.
On the experimental side, we show Sinkformers enhance model accuracy in vision and natural language processing tasks.
Importantly, on 3D shapes classification, Sinkformers lead to a significant improvement.
arXiv Detail & Related papers (2021-10-22T13:25:01Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - Graph Gamma Process Generalized Linear Dynamical Systems [60.467040479276704]
We introduce graph gamma process (GGP) linear dynamical systems to model real multivariate time series.
For temporal pattern discovery, the latent representation under the model is used to decompose the time series into a parsimonious set of multivariate sub-sequences.
We use the generated random graph, whose number of nonzero-degree nodes is finite, to define both the sparsity pattern and dimension of the latent state transition matrix.
arXiv Detail & Related papers (2020-07-25T04:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.