The Mean-Field Dynamics of Transformers
- URL: http://arxiv.org/abs/2512.01868v2
- Date: Tue, 09 Dec 2025 14:40:27 GMT
- Title: The Mean-Field Dynamics of Transformers
- Authors: Philippe Rigollet,
- Abstract summary: By idealizing attention on the sphere, we connect Transformer dynamics to Wasserstein gradient flows (Kuramoto), and mean-shift clustering.<n>Results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.
- Score: 6.008788032203683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.
Related papers
- Krause Synchronization Transformers [63.8469912831803]
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer.<n>We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics.
arXiv Detail & Related papers (2026-02-12T03:47:53Z) - A multiscale analysis of mean-field transformers in the moderate interaction regime [7.742297876120561]
We study the evolution of tokens through the depth of encoder-only transformer models at inference time.<n>We provide a rigorous characterization of the limiting dynamics in each of these phases and prove convergence in the above mentioned limit.
arXiv Detail & Related papers (2025-09-29T16:57:04Z) - Kuramoto Orientation Diffusion Models [67.0711709825854]
Orientation-rich images, such as fingerprints and textures, often exhibit coherent angular patterns.<n>Motivated by the role of phase synchronization in biological systems, we propose a score-based generative model.<n>We implement competitive results on general image benchmarks and significantly improves generation quality on orientation-dense datasets like fingerprints and textures.
arXiv Detail & Related papers (2025-09-18T18:18:49Z) - Quantitative Clustering in Mean-Field Transformer Models [32.46389492080837]
The evolution of tokens through a deep transformer models can be modeled as an interacting particle system.<n>We investigate the long-time clustering of mean-field transformer models.
arXiv Detail & Related papers (2025-04-20T18:21:34Z) - Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism.<n>We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z) - Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image
Compression [63.56922682378755]
We focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding.
The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform.
Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
arXiv Detail & Related papers (2023-08-17T01:34:51Z) - Multi-View Clustering via Semi-non-negative Tensor Factorization [120.87318230985653]
We develop a novel multi-view clustering based on semi-non-negative tensor factorization (Semi-NTF)
Our model directly considers the between-view relationship and exploits the between-view complementary information.
In addition, we provide an optimization algorithm for the proposed method and prove mathematically that the algorithm always converges to the stationary KKT point.
arXiv Detail & Related papers (2023-03-29T14:54:19Z) - Koopman-based spectral clustering of directed and time-evolving graphs [0.3655021726150368]
spectral clustering algorithms for undirected graphs are well established and have been successfully applied to unsupervised machine learning problems.
However, clustering directed graphs remains notoriously difficult and there is no universally accepted definition of clusters in directed graphs.
We derive clustering algorithms for directed and time-evolving graphs using relationships between Laplacians and transfer operators.
The resulting clusters can be interpreted as coherent sets, which play an important role in the analysis of transport and mixing processes in fluid flows.
arXiv Detail & Related papers (2022-04-06T17:33:24Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - ClusterVO: Clustering Moving Instances and Estimating Visual Odometry
for Self and Surroundings [54.33327082243022]
ClusterVO is a stereo Visual Odometry which simultaneously clusters and estimates the motion of both ego and surrounding rigid clusters/objects.
Unlike previous solutions relying on batch input or imposing priors on scene structure or dynamic object models, ClusterVO is online, general and thus can be used in various scenarios including indoor scene understanding and autonomous driving.
arXiv Detail & Related papers (2020-03-29T09:06:28Z) - Self-Supervised Learning of Generative Spin-Glasses with Normalizing
Flows [0.0]
We develop continuous spin-glass distributions with normalizing flows to model correlations in generic discrete problems.
We demonstrate that key physical and computational properties of the spin-glass phase can be successfully learned.
Remarkably, we observe that the learning itself corresponds to a spin-glass phase transition within the layers of the trained normalizing flows.
arXiv Detail & Related papers (2020-01-02T19:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.