Krause Synchronization Transformers
- URL: http://arxiv.org/abs/2602.11534v1
- Date: Thu, 12 Feb 2026 03:47:53 GMT
- Title: Krause Synchronization Transformers
- Authors: Jingkun Liu, Yisong Yue, Max Welling, Yue Song,
- Abstract summary: Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer.<n>We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics.
- Score: 63.8469912831803
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.
Related papers
- Investigation of quantum chaos in local and non-local Ising models [0.0]
We investigate quantum chaos within Ising spin chains subjected to transverse and longitudinal fields.<n>We show that systems with non-local interactions display a stronger propensity toward chaos, even when the non-local couplings are weak.<n>Our findings underscore the role of non-local interactions in accelerating the onset of chaos and modifying dynamical complexity in quantum spin chains.
arXiv Detail & Related papers (2025-12-25T15:25:01Z) - The Mean-Field Dynamics of Transformers [6.008788032203683]
By idealizing attention on the sphere, we connect Transformer dynamics to Wasserstein gradient flows (Kuramoto), and mean-shift clustering.<n>Results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.
arXiv Detail & Related papers (2025-12-01T16:51:00Z) - Kuramoto Orientation Diffusion Models [67.0711709825854]
Orientation-rich images, such as fingerprints and textures, often exhibit coherent angular patterns.<n>Motivated by the role of phase synchronization in biological systems, we propose a score-based generative model.<n>We implement competitive results on general image benchmarks and significantly improves generation quality on orientation-dense datasets like fingerprints and textures.
arXiv Detail & Related papers (2025-09-18T18:18:49Z) - Dynamic Relational Priming Improves Transformer in Multivariate Time Series [0.0]
We propose attention with dynamic relational priming (prime attention)<n>We show that prime attention consistently outperforms standard attention across benchmarks.<n>We also find that prime attention achieves comparable or superior performance using up to 40% less sequence length compared to standard attention.
arXiv Detail & Related papers (2025-09-15T17:56:15Z) - Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions [94.21989689001848]
We propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks)<n>By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$times$ and surpassing LinFusion by 5.42$times$ in efficiency--all without compromising generative fidelity.
arXiv Detail & Related papers (2025-04-30T03:57:28Z) - Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - Relational Self-Attention: What's Missing in Attention for Video
Understanding [52.38780998425556]
We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
arXiv Detail & Related papers (2021-11-02T15:36:11Z) - Critically slow operator dynamics in constrained many-body systems [0.0]
We show that in certain constrained many-body systems the structure of conservation laws can cause a drastic modification of this universal behavior.
We identify a critical point with sub-ballistically moving OTOC front, that separates a ballistic from a dynamically frozen phase.
arXiv Detail & Related papers (2021-06-09T18:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.