Related papers: A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance

A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance

URL: http://arxiv.org/abs/2512.07647v1
Date: Mon, 08 Dec 2025 15:36:41 GMT
Title: A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance
Authors: Georgios Tzachristas, Lei Deng, Ioannis Tzachristas, Gong Zhang, Renhai Chen,
Abstract summary: We develop a unified framework certified Top-$k$ attention truncation that quantifies error at both the distribution and output levels.<n>We show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $mathrmTV(P,hat P)=1-e-mathrmTV(P,hat P)=1-e-mathrmTV(P,hat P)$.
Score: 7.014801584517052
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We develop a unified mathematical framework for certified Top-$k$ attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution $P$ and its Top-$k$ truncation $\hat P$, we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $\mathrm{TV}(P,\hat P)=1-e^{-\mathrm{KL}(\hat P\Vert P)}$, yielding sharp Top-$k$-specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control $\mathrm{TV}(P,\hat P)$ using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2=τ\|μ_{\mathrm{tail}}-μ_{\mathrm{head}}\|_2$ with $τ=\mathrm{TV}(P,\hat P)$, yielding a new head-tail diameter bound $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2\leτ\,\mathrm{diam}_{H,T}$ and refinements linking the error to $\mathrm{Var}_P(V)$. Under an i.i.d. Gaussian score model $s_i\sim\mathcal N(μ,σ^2)$ we derive closed-form tail masses and an asymptotic rule for the minimal $k_\varepsilon$ ensuring $\mathrm{TV}(P,\hat P)\le\varepsilon$, namely $k_\varepsilon/n\approxΦ_c(σ+Φ^{-1}(\varepsilon))$. Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of $k_\varepsilon/n$ and show that certified Top-$k$ can reduce scored keys by 2-4$\times$ on average while meeting the prescribed total-variation budget.

Related papers

On the Capacity Region of Individual Key Rates in Vector Linear Secure Aggregation [55.126702858312456]
We show that it is not necessary for every user to hold a key, thereby strictly enlarging the best-known achievable region in the literature.<n>Our results uncover the novel fact that it is not necessary for every user to hold a key, thereby strictly enlarging the best-known achievable region in the literature.
arXiv Detail & Related papers (2026-01-06T18:34:07Z)
$\mathsf{P} \neq \mathsf{NP}$: A Non-Relativizing Proof via Quantale Weakness and Geometric Complexity [0.0]
We work in the weakness quantale $w_Q=K_mathrmpoly(cdotmidcdot)$.<n>For an efficiently samplable ensemble $D_m$ made by masking random $3$-CNFs, we prove a Switching-by-Weakness normal form.
arXiv Detail & Related papers (2025-10-09T21:01:17Z)
Approximating the operator norm of local Hamiltonians via few quantum states [53.16156504455106]
Consider a Hermitian operator $A$ acting on a complex Hilbert space of $2n$.<n>We show that when $A$ has small degree in the Pauli expansion, or in other words, $A$ is a local $n$-qubit Hamiltonian.<n>We show that whenever $A$ is $d$-local, textiti.e., $deg(A)le d$, we have the following discretization-type inequality.
arXiv Detail & Related papers (2025-09-15T14:26:11Z)
Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs [54.28273395444243]
We show that the Monotonic Value Omega (MVP) algorithm achieves a variance-aware gap-dependent regret bound of $$tildeOleft(left(sum_Delta_h(s,a)>0 fracH2 log K land mathttVar_maxtextc$.
arXiv Detail & Related papers (2025-06-06T20:33:57Z)
Efficient Continual Finite-Sum Minimization [52.5238287567572]
We propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization. Our approach significantly improves upon the $mathcalO(n/epsilon)$ FOs that $mathrmStochasticGradientDescent$ requires. We also prove that there is no natural first-order method with $mathcalOleft(n/epsilonalpharight)$ complexity gradient for $alpha 1/4$, establishing that the first-order complexity of our method is nearly tight.
arXiv Detail & Related papers (2024-06-07T08:26:31Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
Estimation and Inference in Distributional Reinforcement Learning [28.253677740976197]
We show that a dataset of size $widetilde Oleft(frac|mathcalS||mathcalA|epsilon2 (1-gamma)4right)$ suffices to ensure the Kolmogorov metric and total variation metric between $hatetapi$ and $etapi$ is below $epsilon$ with high probability. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $etapi$.
arXiv Detail & Related papers (2023-09-29T14:14:53Z)
Perturbation Analysis of Randomized SVD and its Applications to Statistics [8.731676546744353]
RSVD is a class of computationally efficient algorithms for computing the truncated SVD of large data matrices.<n>In this paper we derive upper bounds for the $ell$ and $ell_2,infty$ distances between the exact left singular vectors $widehatmathbfU$ of $widehatmathbfM$.<n>We apply our theoretical results to settings where $widehatmathbfM$ is an additive perturbation of some unobserved signal matrix $mathbfM$.
arXiv Detail & Related papers (2022-03-19T07:26:45Z)
Low-degree learning and the metric entropy of polynomials [44.99833362998488]
We prove that any (deterministic or randomized) algorithm which learns $mathscrF_nd$ with $L$-accuracy $varepsilon$ requires at least $Omega(sqrtvarepsilon)2dlog n leq log mathsfM(mathscrF_n,d,|cdot|_L,varepsilon) satisfies the two-sided estimate $$c (1-varepsilon)2dlog
arXiv Detail & Related papers (2022-03-17T23:52:08Z)
Efficient inference of interventional distributions [13.31079561447385]
We consider the problem of efficiently inferring interventional distributions in a causal Bayesian network from a finite number of observations. We show that when $mathbfY$ is an arbitrary set, there is no efficient algorithm that outputs an evaluator of a distribution that is $varepsilon$-close to $P_bf x(mathbfY)$ unless all problems that have statistical zero-knowledge, including the Graph Isomorphism problem, have efficient randomized algorithms.
arXiv Detail & Related papers (2021-07-25T02:40:01Z)
The planted matching problem: Sharp threshold and infinite-order phase transition [25.41713098167692]
We study the problem of reconstructing a perfect matching $M*$ hidden in a randomly weighted $ntimes n$ bipartite graph. We show that if $sqrtd B(mathcalP,mathcalQ) ge 1+epsilon$ for an arbitrarily small constant $epsilon>0$, the reconstruction error for any estimator is shown to be bounded away from $0$.
arXiv Detail & Related papers (2021-03-17T00:59:33Z)
From Smooth Wasserstein Distance to Dual Sobolev Norm: Empirical Approximation and Statistical Applications [18.618590805279187]
We show that $mathsfW_p(sigma)$ is controlled by a $pth order smooth dual Sobolev $mathsfd_p(sigma)$. We derive the limit distribution of $sqrtnmathsfd_p(sigma)(hatmu_n,mu)$ in all dimensions $d$, when $mu$ is sub-Gaussian.
arXiv Detail & Related papers (2021-01-11T17:23:24Z)
Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals [49.60752558064027]
We study the fundamental problems of agnostically learning halfspaces and ReLUs under Gaussian marginals. Our lower bounds provide strong evidence that current upper bounds for these tasks are essentially best possible.
arXiv Detail & Related papers (2020-06-29T17:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.