Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients
- URL: http://arxiv.org/abs/2512.02342v1
- Date: Tue, 02 Dec 2025 02:24:32 GMT
- Title: Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients
- Authors: Dimitris Oikonomou, Nicolas Loizou,
- Abstract summary: The vanishing Polyak delivering adaptive neural network has proven to be a promising choice for gradient descent (SGD)<n> Comprehensive experiments on deep networks corroborate tight convex network theory.<n>In this work, we provide rigorous convergence guarantees for non-smooth optimization with no need for strong assumptions.
- Score: 16.39606116102731
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size accelerates convergence, reduces variance, and consistently outperforms existing adaptive baselines. Finally, in the context of deep neural network training, our method demonstrates robust performance by addressing the vanishing gradient problem.
Related papers
- Closed-Form Last Layer Optimization [72.49151473937319]
Under a squared loss, the optimal solution to the linear last layer weights is known in closed-form.<n>We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer.
arXiv Detail & Related papers (2025-10-06T09:14:39Z) - Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes [3.246129789918632]
The training of deep neural networks is inherently a non- optimization problem.<n>Standard approaches such as gradient descent (SGD) require simultaneous updates to parameters.<n>We propose a novel method Alternating Train Miniization with tailored step sizes (SAMT)<n>SAMT achieves better performance with fewer parameter updates compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-08-06T08:23:38Z) - Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Locally Adaptive Federated Learning [30.19411641685853]
Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model.
Standard federated optimization methods such as Federated Averaging (FedAvg) ensure generalization among the clients.
We propose locally federated learning algorithms, that leverage the local geometric information for each client function.
arXiv Detail & Related papers (2023-07-12T17:02:32Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z) - The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods [12.93796690939018]
We prove that the adaptive Polyak's Heavy-ball (HB) method attains an optimal individual convergence rate of $O(frac1sqrtt)$.
Our new analysis shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization.
arXiv Detail & Related papers (2021-02-15T02:57:14Z) - On The Verification of Neural ODEs with Stochastic Guarantees [14.490826225393096]
We show that Neural ODEs, an emerging class of timecontinuous neural networks, can be verified by solving a set of global-optimization problems.
We introduce Lagran Reachability ( SLR), an abstraction-based technique for constructing a tight Reachtube.
arXiv Detail & Related papers (2020-12-16T11:04:34Z) - The Strength of Nesterov's Extrapolation in the Individual Convergence
of Nonsmooth Optimization [0.0]
We prove that Nesterov's extrapolation has the strength to make the individual convergence of gradient descent methods optimal for nonsmooth problems.
We give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in settings.
Our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems.
arXiv Detail & Related papers (2020-06-08T03:35:41Z) - Sparse Perturbations for Improved Convergence in Stochastic Zeroth-Order
Optimization [10.907491258280608]
Interest in zeroth-order (SZO) methods has recently been revived in black-box optimization scenarios such as adversarial black-box attacks to deep neural networks.
SZO methods only require the ability to evaluate the objective function at random input points.
We present a SZO optimization method that reduces the dependency on the dimensionality of the random perturbation to be evaluated.
arXiv Detail & Related papers (2020-06-02T16:39:37Z) - Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient
Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients.
We prove new complexity that outperform state-of-the-art results in this case.
We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z) - Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization [71.03797261151605]
Adaptivity is an important yet under-studied property in modern optimization theory.
Our algorithm is proved to achieve the best-available convergence for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
arXiv Detail & Related papers (2020-02-13T05:42:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.