Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization
- URL: http://arxiv.org/abs/2506.06606v1
- Date: Sat, 07 Jun 2025 00:47:07 GMT
- Title: Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization
- Authors: Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, Brian Bullins,
- Abstract summary: We introduce a new accelerated $ell_p$ steepest descent algorithm, called Stacey, to handle non-Euclidean smooth optimization tasks.<n>In addition to providing theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against popular methods.
- Score: 15.179519413549086
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either $\ell_2$ or $\ell_\infty$ norms, there remains a critical gap in handling the non-Euclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated $\ell_p$ steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of $p$ across various models and datasets, underscoring the importance and efficiency of non-Euclidean approaches over standard Euclidean methods. Code can be found at https://github.com/xinyuluo8561/Stacey .
Related papers
- Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z) - Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis [0.0]
We present AdaFisher, a novel adaptive second-order tuning for deep neural networks (DNNs)<n>AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for trainings.<n>We demonstrate that AdaFisher outperforms state-of-the-art approximations in both accuracy and convergence speed.
arXiv Detail & Related papers (2025-04-26T05:02:21Z) - FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization [9.909119107223265]
First-order and second-order methods are in entirely different situations.<n>This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework.<n>FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders.
arXiv Detail & Related papers (2025-03-06T08:30:18Z) - $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [54.94545757220999]
$f$-PO is a novel framework that generalizes and extends existing approaches.<n>We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z) - Faster Acceleration for Steepest Descent [6.972653925522813]
We propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions.<n>Our method provides an iteration improvement of up to $O(d1-frac2p)$ in terms of calls to a first-order oracle.
arXiv Detail & Related papers (2024-09-28T01:21:03Z) - Fast Nonlinear Two-Time-Scale Stochastic Approximation: Achieving $O(1/k)$ Finite-Sample Complexity [2.5382095320488665]
This paper proposes to develop a new variant of the two-time-scale monotone approximation to find the roots of two coupled nonlinear operators.
Our key idea is to leverage the classic Ruppert-Polyak averaging technique to dynamically estimate the operators through their samples.
The estimated values of these averaging steps will then be used in the two-time-scale approximation updates to find the desired solution.
arXiv Detail & Related papers (2024-01-23T13:44:15Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - SketchySGD: Reliable Stochastic Optimization via Randomized Curvature
Estimates [19.420605210427635]
SketchySGD improves upon existing gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian.
We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum.
In the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems.
arXiv Detail & Related papers (2022-11-16T01:05:41Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [134.83964935755964]
In deep learning, different kinds of deep networks typically need different extrapolations, which have to be chosen after multiple trials.<n>To relieve this issue and consistently improve the model training speed deep networks, we propose the ADAtive Nesterov momentum Transformer.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - projUNN: efficient method for training deep networks with unitary
matrices [21.11571804661279]
We introduce two variants of a method that can parameterize full $N$-dimensional unitary or matrices with a training runtime scaling as $O(kN2)$.
Even in the fastest setting, projUNN is able to train a model's unitary parameters to reach comparable performances against baseline implementations.
arXiv Detail & Related papers (2022-03-10T17:04:41Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Byzantine-Resilient Non-Convex Stochastic Gradient Descent [61.6382287971982]
adversary-resilient distributed optimization, in which.
machines can independently compute gradients, and cooperate.
Our algorithm is based on a new concentration technique, and its sample complexity.
It is very practical: it improves upon the performance of all prior methods when no.
setting machines are present.
arXiv Detail & Related papers (2020-12-28T17:19:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.