Related papers: What are you sinking? A geometric approach on attention sink

What are you sinking? A geometric approach on attention sink

URL: http://arxiv.org/abs/2508.02546v1
Date: Mon, 04 Aug 2025 15:59:15 GMT
Title: What are you sinking? A geometric approach on attention sink
Authors: Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri,
Abstract summary: Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens disproportionately attract attention from other tokens.<n>We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle.
Score: 6.552700667389349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.

Related papers

Cross-architecture universal feature coding via distribution alignment [88.73189953617594]
We introduce a new research problem: cross-architecture universal feature coding (CAUFC)<n>We propose a two-step distribution alignment method. First, we design the format alignment method that CNN and Transformer features into a consistent 2D token format. Second, we propose the feature value alignment method that harmonizes statistical distributions via truncation and normalization.<n>As a first attempt to study CAUFC, we evaluate our method on the image classification task. Experimental results demonstrate that our method achieves superior rate-accuracy trade-offs compared to the architecture-specific baseline.
arXiv Detail & Related papers (2025-06-15T06:14:02Z)
On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper presents a graph-theoretic framework for analyzing position biases in multilayer positions.<n>Our framework offers a principled foundation for understanding positional interplay in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z)
Learning Correlation Structures for Vision Transformers [93.22434535223587]
We introduce a new attention mechanism, dubbed structural self-attention (StructSA) We generate attention maps by recognizing space-time structures of key-query correlations via convolution. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations.
arXiv Detail & Related papers (2024-04-05T07:13:28Z)
GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks. We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation. We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z)
Spherical Position Encoding for Transformers [0.0]
We introduce the notion of "geotokens" which are input elements for transformer architectures. Unlike the natural language the sequential position is not important for the model but the geographical coordinates are. We formulate a position encoding mechanism based on RoPE architecture which is adjusted for spherical coordinates.
arXiv Detail & Related papers (2023-10-04T09:28:59Z)
On the interplay of adversarial robustness and architecture components: patches, convolution and attention [65.20660287833537]
We study the effect of adversarial training on the interpretability of the learnt features and robustness to unseen threat models. An ablation from ResNet to ConvNeXt reveals key architectural changes leading to almost $10%$ higher $ell_infty$-robustness.
arXiv Detail & Related papers (2022-09-14T22:02:32Z)
Ripple Attention for Visual Perception with Sub-quadratic Complexity [7.425337104538644]
Transformer architectures are now central to modeling in natural language processing tasks. We propose ripple attention, a sub-quadratic attention mechanism for visual perception. In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space.
arXiv Detail & Related papers (2021-10-06T02:00:38Z)
Cross-view Geo-localization with Evolving Transformer [7.5800316275498645]
Cross-view geo-localization is challenging due to drastic appearance and geometry differences across views. We devise a novel geo-localization Transformer (EgoTR) that utilizes the properties of self-attention in Transformer to model global dependencies. Our EgoTR performs favorably against state-of-the-art methods on standard, fine-grained and cross-dataset cross-view geo-localization tasks.
arXiv Detail & Related papers (2021-07-02T05:33:14Z)
Twins: Revisiting Spatial Attention Design in Vision Transformers [81.02454258677714]
In this work, we demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. We propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks.
arXiv Detail & Related papers (2021-04-28T15:42:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.