Related papers: Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability

Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability

URL: http://arxiv.org/abs/2511.19716v1
Date: Mon, 24 Nov 2025 21:24:40 GMT
Title: Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability
Authors: Mitchell Scott, Tianshi Xu, Ziyuan Tang, Alexandra Pichette-Emmons, Qiang Ye, Yousef Saad, Yuanzhe Xi,
Abstract summary: Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise.<n>We analyze SGD in geometry induced by a symmetric positive matrix $mathbfM$, deriving bounds in which both the convergence rate and the noise floor areBounded by $mathbfM$-dependent quantities.<n>Experiments on a diagnostic and three SciML benchmarks validate the predicted-floor behavior.
Score: 38.75338802679837
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$, deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities: the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

Related papers

Co-optimization for Adaptive Conformal Prediction [9.881784717196675]
We propose a framework that learns prediction intervals by jointly optimizing a center $m(x)$ and a radius $h(x)$.<n>Experiments on synthetic and real benchmarks demonstrate that CoCP yields consistently shorter intervals and achieves state-of-the-art conditional-coverage diagnostics.
arXiv Detail & Related papers (2026-03-02T10:43:19Z)
Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs [55.77845440440496]
Push-based decentralized communication enables optimization over communication networks, where information exchange may be asymmetric.<n>We develop a unified uniform-stability framework for the Gradient Push (SGP) algorithm.<n>A key technical ingredient is an imbalance-aware generalization bound through two quantities.
arXiv Detail & Related papers (2026-02-24T05:32:03Z)
Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise [17.899443444882888]
We develop a worst-case complexity theory for inequalityally preconditioned gradient descent (SPSGD)<n>We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $mathcalO(T-fracp-13p-2)$ when problem parameters are known, and $mathcalO(T-fracp-12p)$ when problem parameters are unknown.<n>In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the preconditioner and the gradient estimates.
arXiv Detail & Related papers (2026-02-13T19:29:17Z)
SGD Convergence under Stepsize Shrinkage in Low-Precision Training [0.0]
quantizing gradient shrinkage introduces magnitude shrinkage, which can change how gradient descent converges.<n>We show that this shrinkage affect the usual stepsize ( mu_k q_k ) with an effective stepsize ( mu_k q_k )<n>We prove that low-precision SGD still converges, but at a slower pace set by ( q_min ) and with a higher steady error level due to quantization effects.
arXiv Detail & Related papers (2025-08-10T02:25:48Z)
Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization [50.49466204159458]
We propose two novel estimators based on the idea of noise symmetrization.<n>We provide a sharper analysis and improved rates.<n>Compared to works assuming symmetric noise with moments, we provide a sharper analysis and improved rates.
arXiv Detail & Related papers (2025-07-12T00:31:13Z)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
Provable Complexity Improvement of AdaGrad over SGD: Upper and Lower Bounds in Stochastic Non-Convex Optimization [18.47705532817026]
Adaptive gradient methods are among the most successful neural network training algorithms.<n>These methods are known to achieve better dimensional dependence than over convex SGD-normarity.<n>In this paper we introduce new assumptions on the smoothness of the structure and the gradient noise variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z)
Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density. We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order. We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z)
Generalization Bounds for Label Noise Stochastic Gradient Descent [0.0]
We generalization error bounds for gradient descent (SGD) with label noise in non-metric conditions. Our analysis offers insights into the effect of label noise.
arXiv Detail & Related papers (2023-11-01T03:51:46Z)
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates [17.777466668123886]
We introduce PROMISE ($textbfPr$econditioned $textbfO$ptimization $textbfM$ethods by $textbfI$ncorporating $textbfS$calable Curvature $textbfE$stimates), a suite of sketching-based preconditioned gradient algorithms. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha.
arXiv Detail & Related papers (2023-09-05T07:49:10Z)
Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise. We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal. Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Bayesian Sparse learning with preconditioned stochastic gradient MCMC and its applications [5.660384137948734]
The proposed algorithm converges to the correct distribution with a controllable bias under mild conditions. We show that the proposed algorithm canally converge to the correct distribution with a controllable bias under mild conditions.
arXiv Detail & Related papers (2020-06-29T20:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.