Related papers: On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

URL: http://arxiv.org/abs/2601.12238v2
Date: Wed, 21 Jan 2026 04:24:33 GMT
Title: On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization
Authors: Sharan Sahu, Cameron J. Hogan, Martin T. Wells,
Abstract summary: We analyze the tracking performance of Gradient Descent under uniform strong convexity and smoothness in varying stepsize regimes.<n>We show that momentum can substantially amplify drift-induced tracking error, with an explicit penalty on the tracking capability.<n>These results provide a definitive theoretical grounding for the empirical instability of momentum in dynamic environments.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While momentum-based acceleration has been studied extensively in deterministic optimization problems, its behavior in nonstationary environments -- where the data distribution and optimal parameters drift over time -- remains underexplored. We analyze the tracking performance of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak heavy-ball and Nesterov) under uniform strong convexity and smoothness in varying stepsize regimes. We derive finite-time bounds in expectation and with high probability for the tracking error, establishing a sharp decomposition into three components: a transient initialization term, a noise-induced variance term, and a drift-induced tracking lag. Crucially, our analysis uncovers a fundamental trade-off: while momentum can suppress gradient noise, it incurs an explicit penalty on the tracking capability. We show that momentum can substantially amplify drift-induced tracking error, with amplification that becomes unbounded as the momentum parameter approaches one, formalizing the intuition that using 'stale' gradients hinders adaptation to rapid regime shifts. Complementing these upper bounds, we establish minimax lower bounds for dynamic regret under gradient-variation constraints. These lower bounds prove that the inertia-induced penalty is not an artifact of analysis but an information-theoretic barrier: in drift-dominated regimes, momentum creates an unavoidable 'inertia window' that fundamentally degrades performance. Collectively, these results provide a definitive theoretical grounding for the empirical instability of momentum in dynamic environments and delineate the precise regime boundaries where SGD provably outperforms its accelerated counterparts.

Related papers

When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift [64.37959940809633]
We study robustness of Proximal Policy Optimization (PPO) under temporally persistent sensor failures.<n>We show Transformer-based sequence policies substantially outperform, RNN, and SSMs in robustness, maintaining high returns even when large fractions of sensors are unavailable.
arXiv Detail & Related papers (2026-03-04T22:21:54Z)
GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler [54.10960908347221]
We model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS)<n>GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen.
arXiv Detail & Related papers (2026-02-15T09:57:47Z)
Geometry of Drifting MDPs with Path-Integral Stability Certificates [14.721539799090904]
Real-world reinforcement learning is often emphnonstationary: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action.<n>We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point.<n>This yields a length--curvature--kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness.
arXiv Detail & Related papers (2026-01-29T17:03:23Z)
Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations [57.179679246370114]
We identify the distribution of random perturbations that minimizes the estimator's variance as the perturbation stepsize tends to zero.<n>Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length.
arXiv Detail & Related papers (2025-10-22T19:06:39Z)
Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z)
Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent [0.6906005491572401]
In neural deep networks, gradient descent (SGD) with momentum is said to converge faster and have better generalizability than SGD without momentum.<n>In particular, adding momentum is thought to reduce this batch noise.<n>We analyzed the effect of search direction noise, which is noise defined as the error between the search direction and the steepest descent direction.
arXiv Detail & Related papers (2024-02-04T02:48:28Z)
The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z)
Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Losing momentum in continuous-time stochastic optimisation [42.617042045455506]
momentum-based optimisation algorithms have become particularly widespread. In this work, we analyse a continuous-time model for gradient descent with momentum. We also train a convolutional neural network in an image classification problem.
arXiv Detail & Related papers (2022-09-08T10:46:05Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD) We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs) We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z)
Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime. We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.