Related papers: Geometric Analysis of Token Selection in Multi-Head Attention

Geometric Analysis of Token Selection in Multi-Head Attention

URL: http://arxiv.org/abs/2602.01893v1
Date: Mon, 02 Feb 2026 10:04:40 GMT
Title: Geometric Analysis of Token Selection in Multi-Head Attention
Authors: Timur Mudarisov, Mikhal Burtsev, Tatiana Petrova, Radu State,
Abstract summary: We present a framework for analysing multi-head attention in large language models (LLMs)<n>We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens.
Score: 0.9099663022952497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N selection sharpens separability, sink similarity correlates with Recall. We also found that in LLaMA-2-7B heads specialize into three regimes - Retriever, Mixer, Reset - with distinct geometric signatures. Overall, attention behaves as a structured geometric classifier with measurable criteria for token selection, offering head level interpretability and informing geometry-aware sparsification and design of attention in LLMs.

Related papers

MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity [65.85858856481131]
unstructured and irregular nature of point clouds poses a significant challenge for objective quality assessment (PCQA)<n>We propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM)
arXiv Detail & Related papers (2026-01-03T14:58:52Z)
Universal Structure of Nonlocal Operators for Deterministic Navigation and Geometric Locking [3.178035874842575]
We transform the search for optimal nonlocal operators from a black box into a deterministic predict-verify operation.<n>We show that transitions dominated by strong anisotropy exhibit geometric locking, where the optimal basis remains robust despite clear signatures of phase transitions in the spectral indicators.
arXiv Detail & Related papers (2025-12-16T11:15:47Z)
The Multiqubit Elegant Joint Measurement [0.0]
The Elegant Joint Measurement (EJM) is a highly symmetric, partially entangled two-qubit measurement.<n>We extend the EJM to the multipartite setting by identifying all tetrahedrally symmetric, efficiently localizable multiqubit bases.
arXiv Detail & Related papers (2025-09-02T00:38:14Z)
Axis-level Symmetry Detection with Group-Equivariant Representation [48.813587457507786]
Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes.<n>We propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation.<n>Our method achieves state-of-the-art performance, outperforming existing approaches.
arXiv Detail & Related papers (2025-08-14T15:26:53Z)
Factorization of multimeters: a unified view on nonclassical quantum phenomena [1.4680035572775534]
Quantum theory exhibits various nonclassical features, such as measurement incompatibility, contextuality, steering, and Bell nonlocality.<n>This work introduces a unified mathematical framework based on commuting diagrams that unifies them.
arXiv Detail & Related papers (2025-04-28T14:57:46Z)
Measuring Orthogonality in Representations of Generative Models [81.13466637365553]
In unsupervised representation learning, models aim to distill essential features from high-dimensional data into lower-dimensional learned representations. Disentanglement of independent generative processes has long been credited with producing high-quality representations. We propose two novel metrics: Importance-Weighted Orthogonality (IWO) and Importance-Weighted Rank (IWR)
arXiv Detail & Related papers (2024-07-04T08:21:54Z)
CWF: Consolidating Weak Features in High-quality Mesh Simplification [50.634070540791555]
We propose a smooth functional that simultaneously considers all of these requirements. The functional comprises a normal anisotropy term and a Centroidal Voronoi Tessellation (CVT) energy term.
arXiv Detail & Related papers (2024-04-24T05:37:17Z)
Markovian Sliced Wasserstein Distances: Beyond Independent Projections [51.80527230603978]
We introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We compare distances with previous SW variants in various applications such as flows, color transfer, and deep generative modeling to demonstrate the favorable performance of MSW.
arXiv Detail & Related papers (2023-01-10T01:58:15Z)
Partial Shape Similarity via Alignment of Multi-Metric Hamiltonian Spectra [10.74981839055037]
We propose a novel axiomatic method to match similar regions across shapes. Matching similar regions is formulated as the alignment of the spectra of operators closely related to the Laplace-Beltrami operator (LBO) We show that matching these dual spectra outperforms competing axiomatic frameworks when tested on standard benchmarks.
arXiv Detail & Related papers (2022-07-07T00:03:50Z)
Relative Pose from SIFT Features [50.81749304115036]
We derive a new linear constraint relating the unknown elements of the fundamental matrix and the orientation and scale. The proposed constraint is tested on a number of problems in a synthetic environment and on publicly available real-world datasets on more than 80000 image pairs.
arXiv Detail & Related papers (2022-03-15T14:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.