Related papers: Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

URL: http://arxiv.org/abs/2602.18851v1
Date: Sat, 21 Feb 2026 14:29:22 GMT
Title: Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training
Authors: Seyed Morteza Emadi,
Abstract summary: Attention scores in transformers are bilinear forms $S_ij = x_itop M x_j / sqrtd_h$ whose maximum magnitude governs overflow risk in low-precision training.<n>We derive a emphrank-aware concentration inequality: when the interaction matrix $M = WQ WKtop$ has rank $r ll d$, tail probabilities for $max_i,j|S_ij|$ decay as $exp(-d22/(
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}α^{2}/(γr))$ rather than $\exp(-dα^{2})$, where $γ> 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while achieving comparable downstream MMLU accuracy.

Related papers

Sublinear Time Quantum Algorithm for Attention Approximation [13.665266438908533]
We propose a quantum data structure that approximates any row of $mathrmAtt(Q, K, V)$ using only row queries to $Q, K, V$.<n>Our algorithm preprocesses these matrices in $widetildeOleft( -1 n0.5 left( s_2.5 + s_1.5 d + 0.5 d right)$ time, where $$ is the target accuracy, $s_$ is the $$-statistical dimension of the exponential kernel
arXiv Detail & Related papers (2026-01-31T19:33:52Z)
Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability [0.0]
We give a first-order, module-wise theory that predicts when and where errors grow.<n>For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics.<n>We also introduce a precision- and width-aware LayerNorm indicator $rho_rm LN$ with a matching first-order bound.
arXiv Detail & Related papers (2025-10-17T01:03:02Z)
Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias [0.0]
We give a unified, high-probability characterization for the scaling of the family of parameter norms.<n>We then study linear networks trained by descent.
arXiv Detail & Related papers (2025-09-25T13:59:22Z)
On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [54.28350823319057]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.<n>Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.<n>Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z)
Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic. This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z)
On the Self-Penalization Phenomenon in Feature Selection [69.16452769334367]
We describe an implicit sparsity-inducing mechanism based on over a family of kernels. As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.
arXiv Detail & Related papers (2021-10-12T09:36:41Z)
Random matrices in service of ML footprint: ternary random features with no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$. We propose a novel random technique, called Ternary Random Feature (TRF) The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z)
Entanglement scaling for $\lambda\phi_2^4$ [0.0]
We show that the order parameter $phi$, the correlation length $xi$ and quantities like $phi3$ and the entanglement entropy exhibit useful double scaling properties. We find the value $alpha_c=11.09698(31)$ for the critical point, improving on previous results.
arXiv Detail & Related papers (2021-04-21T14:43:12Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model. We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z)
Robust Interference Management for SISO Systems with Multiple Over-the-Air Computations [16.52374405363812]
We consider the over-the-air computation of sums over a shared complex-valued MAC. Finding appropriate Tx-Rx scaling factors balance between a low error in the computation of $s_n$ and the interference induced by it.
arXiv Detail & Related papers (2020-04-21T11:15:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.