Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training
- URL: http://arxiv.org/abs/2602.18851v1
- Date: Sat, 21 Feb 2026 14:29:22 GMT
- Title: Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training
- Authors: Seyed Morteza Emadi,
- Abstract summary: Attention scores in transformers are bilinear forms $S_ij = x_itop M x_j / sqrtd_h$ whose maximum magnitude governs overflow risk in low-precision training.<n>We derive a emphrank-aware concentration inequality: when the interaction matrix $M = WQ WKtop$ has rank $r ll d$, tail probabilities for $max_i,j|S_ij|$ decay as $exp(-d22/(
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}α^{2}/(γr))$ rather than $\exp(-dα^{2})$, where $γ> 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while achieving comparable downstream MMLU accuracy.
Related papers
- Sublinear Time Quantum Algorithm for Attention Approximation [13.665266438908533]
We propose a quantum data structure that approximates any row of $mathrmAtt(Q, K, V)$ using only row queries to $Q, K, V$.<n>Our algorithm preprocesses these matrices in $widetildeOleft( -1 n0.5 left( s_2.5 + s_1.5 d + 0.5 d right)$ time, where $$ is the target accuracy, $s_$ is the $$-statistical dimension of the exponential kernel
arXiv Detail & Related papers (2026-01-31T19:33:52Z) - Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability [0.0]
We give a first-order, module-wise theory that predicts when and where errors grow.<n>For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics.<n>We also introduce a precision- and width-aware LayerNorm indicator $rho_rm LN$ with a matching first-order bound.
arXiv Detail & Related papers (2025-10-17T01:03:02Z) - Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias [0.0]
We give a unified, high-probability characterization for the scaling of the family of parameter norms.<n>We then study linear networks trained by descent.
arXiv Detail & Related papers (2025-09-25T13:59:22Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [54.28350823319057]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.<n>Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.<n>Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - On the Self-Penalization Phenomenon in Feature Selection [69.16452769334367]
We describe an implicit sparsity-inducing mechanism based on over a family of kernels.
As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.
arXiv Detail & Related papers (2021-10-12T09:36:41Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Entanglement scaling for $\lambda\phi_2^4$ [0.0]
We show that the order parameter $phi$, the correlation length $xi$ and quantities like $phi3$ and the entanglement entropy exhibit useful double scaling properties.
We find the value $alpha_c=11.09698(31)$ for the critical point, improving on previous results.
arXiv Detail & Related papers (2021-04-21T14:43:12Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - Robust Interference Management for SISO Systems with Multiple
Over-the-Air Computations [16.52374405363812]
We consider the over-the-air computation of sums over a shared complex-valued MAC.
Finding appropriate Tx-Rx scaling factors balance between a low error in the computation of $s_n$ and the interference induced by it.
arXiv Detail & Related papers (2020-04-21T11:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.