Related papers: Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability

Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability

URL: http://arxiv.org/abs/2510.21770v1
Date: Fri, 17 Oct 2025 01:03:02 GMT
Title: Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability
Authors: Jinwoo Baek,
Abstract summary: We give a first-order, module-wise theory that predicts when and where errors grow.<n>For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics.<n>We also introduce a precision- and width-aware LayerNorm indicator $rho_rm LN$ with a matching first-order bound.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers trained in low precision can suffer forward-error amplification. We give a first-order, module-wise theory that predicts when and where errors grow. For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics: a score-scale ratio $\kappa_{\rm score}$, a rowwise softmax sensitivity $\kappa_{\rm softmax}$, and value conditioning $\kappa(V)$. We prove a residual relaxation inequality showing that residual blocks attenuate depth-wise accumulation, and we introduce a precision- and width-aware LayerNorm indicator $\rho_{\rm LN}$ with a matching first-order bound in the $\epsilon$-dominated regime. These pieces yield a unified forward-stability bound whose right-hand side is directly estimable during training. On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined predictor $\kappa_{\rm softmax},(1+\kappa_{\rm score}),\kappa(V),|W_O|2+\kappa{\rm eff}+C_{\rm LN}$ tracks FP32$\leftrightarrow$LP mismatches across seeds, widths, and precisions; scaling by $\epsilon_{\rm mach}$ collapses mixed-precision points. (2) The time-series maximum of $\kappa_{\rm softmax}$ acts as an early-warning signal, leading error spikes by 16-24 steps (corr. 0.65-0.82; permutation $p!\approx!10^{-3}$; Precision@K 0.89-1.00). (3) Guided by $\rho_{\rm LN}$, a small LayerNorm-$\epsilon$ tweak targeting $\rho_\star$ gives consistent stabilization (mean tail-loss $\downarrow\ \approx0.010$ at $\rho_\star!=!0.6$, cap$=10^{-2}$) with negligible overhead. Overall, our theory supplies actionable, unitless diagnostics that (i) explain when self-attention is fragile, (ii) forecast instability, and (iii) motivate a minimally invasive mitigation.

Related papers

Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training [0.0]
Attention scores in transformers are bilinear forms $S_ij = x_itop M x_j / sqrtd_h$ whose maximum magnitude governs overflow risk in low-precision training.<n>We derive a emphrank-aware concentration inequality: when the interaction matrix $M = WQ WKtop$ has rank $r ll d$, tail probabilities for $max_i,j|S_ij|$ decay as $exp(-d22/(
arXiv Detail & Related papers (2026-02-21T14:29:22Z)
From Average Sensitivity to Small-Loss Regret Bounds under Random-Order Model [26.860985092865203]
We study online learning in the random-order model, where the multiset of loss functions is chosen adversarially but revealed in a uniformly random order.<n>Our approach sheds light on the power of sparsification and related techniques in establishing small-loss regret bounds in the random-order model.
arXiv Detail & Related papers (2026-02-10T06:46:01Z)
A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance [7.014801584517052]
We develop a unified framework certified Top-$k$ attention truncation that quantifies error at both the distribution and output levels.<n>We show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $mathrmTV(P,hat P)=1-e-mathrmTV(P,hat P)=1-e-mathrmTV(P,hat P)$.
arXiv Detail & Related papers (2025-12-08T15:36:41Z)
Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games [60.871651115241406]
A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods.<n>We propose a new scale-invariance and parameter-free variant of PRM$+$, which we call IREG-PRM$+$.<n>We show that it achieves $T-1/2$ best-iterate and $T-1$ optimal convergence guarantees, while also being on par with PRM$+$ on benchmark games.
arXiv Detail & Related papers (2025-10-06T00:33:20Z)
Robust learning of halfspaces under log-concave marginals [6.852292115526837]
We give an algorithm that learns linear threshold functions and returns a classifier with boundary volume $O(r+varepsilon)$ at radius perturbation $r$.<n>The time and sample complexity of $dtildeO (1/varepsilon2)$ matches the complexity of Boolean regression.
arXiv Detail & Related papers (2025-05-19T20:12:16Z)
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z)
Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z)
Exact results on finite size corrections for surface codes tailored to biased noise [0.0]
We study the XY and XZZX surface codes under phase-biased noise. We find exact solutions at a special disordered point. We show that calculating thresholds based not only on the total logical failure rate, but also independently on the phase- and bit-flip logical failure rates, gives a more confident estimate.
arXiv Detail & Related papers (2024-01-08T16:38:56Z)
Nearly Minimax Optimal Submodular Maximization with Bandit Feedback [12.28389976959093]
We minimize the learner's regret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| = k$.<n>In this work, we establish the first minimax lower bound for this setting that scales like $tildeOmega(min_L le k(T2/3 + sqrtn choose k - LT)$.<n>For a slightly restricted algorithm class, we prove a stronger regret lower bound of $tildeOmega(min_L
arXiv Detail & Related papers (2023-10-27T20:19:03Z)
Optimal Query Complexities for Dynamic Trace Estimation [59.032228008383484]
We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly. We provide a novel binary tree summation procedure that simultaneously estimates all $m$ traces up to $epsilon$ error with $delta$ failure probability. Our lower bounds (1) give the first tight bounds for Hutchinson's estimator in the matrix-vector product model with Frobenius norm error even in the static setting, and (2) are the first unconditional lower bounds for dynamic trace estimation.
arXiv Detail & Related papers (2022-09-30T04:15:44Z)
A spectral least-squares-type method for heavy-tailed corrupted regression with unknown covariance \& heterogeneous noise [2.019622939313173]
We revisit heavy-tailed corrupted least-squares linear regression assuming to have a corrupted $n$-sized label-feature sample of at most $epsilon n$ arbitrary outliers. We propose a near-optimal computationally tractable estimator, based on the power method, assuming no knowledge on $(Sigma,Xi) nor the operator norm of $Xi$.
arXiv Detail & Related papers (2022-09-06T23:37:31Z)
Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor Decompositions [51.19236668224547]
We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions. For tensor train decomposition, we give a bicriteria $(1 + eps)$-approximation algorithm with a small bicriteria rank and $O(q cdot nnz(A))$ running time. In addition, we extend our algorithm to tensor networks with arbitrary graphs.
arXiv Detail & Related papers (2022-07-15T11:55:09Z)
More Optimal Simulation of Universal Quantum Computers [0.0]
Worst-case sampling cost has plateaued at $le(2+sqrt2)xi_t delta-1$ in the limit that $t rightarrow infty$. We reduce this prefactor 68-fold by a leading-order reduction in $t$ through correlated sampling.
arXiv Detail & Related papers (2022-02-02T19:00:03Z)
Self-training Converts Weak Learners to Strong Learners in Mixture Models [86.7137362125503]
We show that a pseudolabeler $boldsymbolbeta_mathrmpl$ can achieve classification error at most $C_mathrmerr$. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $boldsymbolbeta_mathrmpl$ with classification error $C_mathrmerr$ using only $O(d)$ labeled examples.
arXiv Detail & Related papers (2021-06-25T17:59:16Z)
Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization [133.53164856723782]
We propose a new accelerated zeroth-order momentum (AccZOM) method for black-box mini-optimization where only function values can be obtained. Meanwhile, we propose an accelerated zeroth-order momentum descent (Acc-MDA) method for black-box minimax optimization, where only function values can be obtained. In particular, our Acc-MDA can obtain a lower gradient complexity of $tildeO(kappa_y2.5epsilon-3)$ with a batch size $O(kappa_y4)$.
arXiv Detail & Related papers (2020-08-18T22:19:29Z)
Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs [35.988644745703645]
We analyze the linear bandits with heavy-tailed payoffs, where the payoffs admit finite $1+epsilon$ moments. We propose two novel algorithms which enjoy a sublinear regret bound of $widetildeO(dfrac12Tfrac11+epsilon)$.
arXiv Detail & Related papers (2020-04-28T13:01:38Z)
Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors. We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.