Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability
- URL: http://arxiv.org/abs/2510.21770v1
- Date: Fri, 17 Oct 2025 01:03:02 GMT
- Title: Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability
- Authors: Jinwoo Baek,
- Abstract summary: We give a first-order, module-wise theory that predicts when and where errors grow.<n>For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics.<n>We also introduce a precision- and width-aware LayerNorm indicator $rho_rm LN$ with a matching first-order bound.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers trained in low precision can suffer forward-error amplification. We give a first-order, module-wise theory that predicts when and where errors grow. For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics: a score-scale ratio $\kappa_{\rm score}$, a rowwise softmax sensitivity $\kappa_{\rm softmax}$, and value conditioning $\kappa(V)$. We prove a residual relaxation inequality showing that residual blocks attenuate depth-wise accumulation, and we introduce a precision- and width-aware LayerNorm indicator $\rho_{\rm LN}$ with a matching first-order bound in the $\epsilon$-dominated regime. These pieces yield a unified forward-stability bound whose right-hand side is directly estimable during training. On Tiny-ViT/CIFAR-10 we evaluate the bound and components. (1) The combined predictor $\kappa_{\rm softmax},(1+\kappa_{\rm score}),\kappa(V),|W_O|2+\kappa{\rm eff}+C_{\rm LN}$ tracks FP32$\leftrightarrow$LP mismatches across seeds, widths, and precisions; scaling by $\epsilon_{\rm mach}$ collapses mixed-precision points. (2) The time-series maximum of $\kappa_{\rm softmax}$ acts as an early-warning signal, leading error spikes by 16-24 steps (corr. 0.65-0.82; permutation $p!\approx!10^{-3}$; Precision@K 0.89-1.00). (3) Guided by $\rho_{\rm LN}$, a small LayerNorm-$\epsilon$ tweak targeting $\rho_\star$ gives consistent stabilization (mean tail-loss $\downarrow\ \approx0.010$ at $\rho_\star!=!0.6$, cap$=10^{-2}$) with negligible overhead. Overall, our theory supplies actionable, unitless diagnostics that (i) explain when self-attention is fragile, (ii) forecast instability, and (iii) motivate a minimally invasive mitigation.
Related papers
- Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training [0.0]
Attention scores in transformers are bilinear forms $S_ij = x_itop M x_j / sqrtd_h$ whose maximum magnitude governs overflow risk in low-precision training.<n>We derive a emphrank-aware concentration inequality: when the interaction matrix $M = WQ WKtop$ has rank $r ll d$, tail probabilities for $max_i,j|S_ij|$ decay as $exp(-d22/(
arXiv Detail & Related papers (2026-02-21T14:29:22Z) - From Average Sensitivity to Small-Loss Regret Bounds under Random-Order Model [26.860985092865203]
We study online learning in the random-order model, where the multiset of loss functions is chosen adversarially but revealed in a uniformly random order.<n>Our approach sheds light on the power of sparsification and related techniques in establishing small-loss regret bounds in the random-order model.
arXiv Detail & Related papers (2026-02-10T06:46:01Z) - A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance [7.014801584517052]
We develop a unified framework certified Top-$k$ attention truncation that quantifies error at both the distribution and output levels.<n>We show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $mathrmTV(P,hat P)=1-e-mathrmTV(P,hat P)=1-e-mathrmTV(P,hat P)$.
arXiv Detail & Related papers (2025-12-08T15:36:41Z) - Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games [60.871651115241406]
A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods.<n>We propose a new scale-invariance and parameter-free variant of PRM$+$, which we call IREG-PRM$+$.<n>We show that it achieves $T-1/2$ best-iterate and $T-1$ optimal convergence guarantees, while also being on par with PRM$+$ on benchmark games.
arXiv Detail & Related papers (2025-10-06T00:33:20Z) - Robust learning of halfspaces under log-concave marginals [6.852292115526837]
We give an algorithm that learns linear threshold functions and returns a classifier with boundary volume $O(r+varepsilon)$ at radius perturbation $r$.<n>The time and sample complexity of $dtildeO (1/varepsilon2)$ matches the complexity of Boolean regression.
arXiv Detail & Related papers (2025-05-19T20:12:16Z) - Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z) - Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Exact results on finite size corrections for surface codes tailored to biased noise [0.0]
We study the XY and XZZX surface codes under phase-biased noise.
We find exact solutions at a special disordered point.
We show that calculating thresholds based not only on the total logical failure rate, but also independently on the phase- and bit-flip logical failure rates, gives a more confident estimate.
arXiv Detail & Related papers (2024-01-08T16:38:56Z) - Nearly Minimax Optimal Submodular Maximization with Bandit Feedback [12.28389976959093]
We minimize the learner's regret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| = k$.<n>In this work, we establish the first minimax lower bound for this setting that scales like $tildeOmega(min_L le k(T2/3 + sqrtn choose k - LT)$.<n>For a slightly restricted algorithm class, we prove a stronger regret lower bound of $tildeOmega(min_L
arXiv Detail & Related papers (2023-10-27T20:19:03Z) - Optimal Query Complexities for Dynamic Trace Estimation [59.032228008383484]
We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly.
We provide a novel binary tree summation procedure that simultaneously estimates all $m$ traces up to $epsilon$ error with $delta$ failure probability.
Our lower bounds (1) give the first tight bounds for Hutchinson's estimator in the matrix-vector product model with Frobenius norm error even in the static setting, and (2) are the first unconditional lower bounds for dynamic trace estimation.
arXiv Detail & Related papers (2022-09-30T04:15:44Z) - A spectral least-squares-type method for heavy-tailed corrupted
regression with unknown covariance \& heterogeneous noise [2.019622939313173]
We revisit heavy-tailed corrupted least-squares linear regression assuming to have a corrupted $n$-sized label-feature sample of at most $epsilon n$ arbitrary outliers.
We propose a near-optimal computationally tractable estimator, based on the power method, assuming no knowledge on $(Sigma,Xi) nor the operator norm of $Xi$.
arXiv Detail & Related papers (2022-09-06T23:37:31Z) - Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor
Decompositions [51.19236668224547]
We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions.
For tensor train decomposition, we give a bicriteria $(1 + eps)$-approximation algorithm with a small bicriteria rank and $O(q cdot nnz(A))$ running time.
In addition, we extend our algorithm to tensor networks with arbitrary graphs.
arXiv Detail & Related papers (2022-07-15T11:55:09Z) - More Optimal Simulation of Universal Quantum Computers [0.0]
Worst-case sampling cost has plateaued at $le(2+sqrt2)xi_t delta-1$ in the limit that $t rightarrow infty$.
We reduce this prefactor 68-fold by a leading-order reduction in $t$ through correlated sampling.
arXiv Detail & Related papers (2022-02-02T19:00:03Z) - Self-training Converts Weak Learners to Strong Learners in Mixture
Models [86.7137362125503]
We show that a pseudolabeler $boldsymbolbeta_mathrmpl$ can achieve classification error at most $C_mathrmerr$.
We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $boldsymbolbeta_mathrmpl$ with classification error $C_mathrmerr$ using only $O(d)$ labeled examples.
arXiv Detail & Related papers (2021-06-25T17:59:16Z) - Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to
Minimax Optimization [133.53164856723782]
We propose a new accelerated zeroth-order momentum (AccZOM) method for black-box mini-optimization where only function values can be obtained.
Meanwhile, we propose an accelerated zeroth-order momentum descent (Acc-MDA) method for black-box minimax optimization, where only function values can be obtained.
In particular, our Acc-MDA can obtain a lower gradient complexity of $tildeO(kappa_y2.5epsilon-3)$ with a batch size $O(kappa_y4)$.
arXiv Detail & Related papers (2020-08-18T22:19:29Z) - Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed
Payoffs [35.988644745703645]
We analyze the linear bandits with heavy-tailed payoffs, where the payoffs admit finite $1+epsilon$ moments.
We propose two novel algorithms which enjoy a sublinear regret bound of $widetildeO(dfrac12Tfrac11+epsilon)$.
arXiv Detail & Related papers (2020-04-28T13:01:38Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.