The emergence of clusters in self-attention dynamics
- URL: http://arxiv.org/abs/2305.05465v6
- Date: Mon, 12 Feb 2024 10:21:08 GMT
- Title: The emergence of clusters in self-attention dynamics
- Authors: Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet
- Abstract summary: We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity.
Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix.
- Score: 24.786862288360076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Viewing Transformers as interacting particle systems, we describe the
geometry of learned representations when the weights are not time dependent. We
show that particles, representing tokens, tend to cluster toward particular
limiting objects as time tends to infinity. Cluster locations are determined by
the initial tokens, confirming context-awareness of representations learned by
Transformers. Using techniques from dynamical systems and partial differential
equations, we show that the type of limiting object that emerges depends on the
spectrum of the value matrix. Additionally, in the one-dimensional case we
prove that the self-attention matrix converges to a low-rank Boolean matrix.
The combination of these results mathematically confirms the empirical
observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence
of tokens when processed by Transformers.
Related papers
- Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity.
We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders.
We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z) - Towards Understanding Inductive Bias in Transformers: A View From Infinity [9.00214539845063]
We argue transformers tend to be biased towards more permutation symmetric functions in sequence space.
We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions.
We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
arXiv Detail & Related papers (2024-02-07T19:00:01Z) - Matrix product state fixed points of non-Hermitian transfer matrices [11.686585954351436]
We investigate the impact of gauge degrees of freedom in the virtual indices of the tensor network on the contraction process.
We show that the gauge transformation can affect the entanglement structures of the eigenstates of the transfer matrix.
arXiv Detail & Related papers (2023-11-30T17:28:30Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Transformers are efficient hierarchical chemical graph learners [7.074125287195362]
SubFormer is a graph transformer that operates on subgraphs that aggregate information by a message-passing mechanism.
We show that SubFormer exhibits limited over-smoothing and avoids over-squashing, which is prevalent in traditional graph neural networks.
arXiv Detail & Related papers (2023-10-02T23:57:04Z) - Mapping of attention mechanisms to a generalized Potts model [50.91742043564049]
We show that training a neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method.
We also compute the generalization error of self-attention in a model scenario analytically using the replica method.
arXiv Detail & Related papers (2023-04-14T16:32:56Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Outliers Dimensions that Disrupt Transformers Are Driven by Frequency [79.22656609637525]
We show that the token frequency contributes to the outlier phenomenon.
We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
arXiv Detail & Related papers (2022-05-23T15:19:09Z) - Sinkformers: Transformers with Doubly Stochastic Attention [22.32840998053339]
We use Sinkhorn's algorithm to make attention matrices doubly. We call the resulting model a Sinkformer.
On the experimental side, we show Sinkformers enhance model accuracy in vision and natural language processing tasks.
Importantly, on 3D shapes classification, Sinkformers lead to a significant improvement.
arXiv Detail & Related papers (2021-10-22T13:25:01Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - Graph Gamma Process Generalized Linear Dynamical Systems [60.467040479276704]
We introduce graph gamma process (GGP) linear dynamical systems to model real multivariate time series.
For temporal pattern discovery, the latent representation under the model is used to decompose the time series into a parsimonious set of multivariate sub-sequences.
We use the generated random graph, whose number of nonzero-degree nodes is finite, to define both the sparsity pattern and dimension of the latent state transition matrix.
arXiv Detail & Related papers (2020-07-25T04:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.