Improved Analysis for Sign-based Methods with Momentum Updates
- URL: http://arxiv.org/abs/2507.12091v1
- Date: Wed, 16 Jul 2025 09:54:08 GMT
- Title: Improved Analysis for Sign-based Methods with Momentum Updates
- Authors: Wei Jiang, Dingzhi Yu, Sifan Yang, Wenhao Yang, Lijun Zhang,
- Abstract summary: We present enhanced analysis for sign-based optimization algorithms with momentum updates.<n>We show that a proposed momentum-based method yields convergence rates of $mathcalOleft( d1/2T-1/2 + dn-1/2 right)$ and $mathcalleft( max d1/4T-1/4 + dn-1/2 right)$.
- Score: 15.679979249491087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present enhanced analysis for sign-based optimization algorithms with momentum updates. Traditional sign-based methods, under the separable smoothness assumption, guarantee a convergence rate of $\mathcal{O}(T^{-1/4})$, but they either require large batch sizes or assume unimodal symmetric stochastic noise. To address these limitations, we demonstrate that signSGD with momentum can achieve the same convergence rate using constant batch sizes without additional assumptions. Our analysis, under the standard $l_2$-smoothness condition, improves upon the result of the prior momentum-based signSGD method by a factor of $\mathcal{O}(d^{1/2})$, where $d$ is the problem dimension. Furthermore, we explore sign-based methods with majority vote in distributed settings and show that the proposed momentum-based method yields convergence rates of $\mathcal{O}\left( d^{1/2}T^{-1/2} + dn^{-1/2} \right)$ and $\mathcal{O}\left( \max \{ d^{1/4}T^{-1/4}, d^{1/10}T^{-1/5} \} \right)$, which outperform the previous results of $\mathcal{O}\left( dT^{-1/4} + dn^{-1/2} \right)$ and $\mathcal{O}\left( d^{3/8}T^{-1/8} \right)$, respectively. Numerical experiments further validate the effectiveness of the proposed methods.
Related papers
- Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
We show that Adam achieves the optimal rate of $cal O(frac1Ts14)$ rather than the previous $cal O(fracln TTs14)$.<n>Our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence.
arXiv Detail & Related papers (2025-07-08T13:19:26Z) - Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm [31.539921770584005]
We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate.<n>Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.
arXiv Detail & Related papers (2025-05-21T05:49:11Z) - Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z) - Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping [21.865728815935665]
We provide the first convergence under heavy-tailed noises but without clipping.<n>We also establish first $mathcalO(Tfrac1-mathfrakp3mathfrakp-2)$ convergence rate in the case where the tail index $mathfrakp$ is unknown in advance.
arXiv Detail & Related papers (2024-12-27T08:46:46Z) - Efficient Sign-Based Optimization: Accelerating Convergence via Variance Reduction [16.82220229840038]
We introduce two novel algorithms that attain improved convergence rates of $mathcalO(d1/2T-1/2 + dn-1/2)$ and $mathcalO(d1/4T-1/4)$ respectively.<n> Numerical experiments across different tasks validate the effectiveness of our proposed methods.
arXiv Detail & Related papers (2024-06-01T16:38:43Z) - Mirror Descent Algorithms with Nearly Dimension-Independent Rates for
Differentially-Private Stochastic Saddle-Point Problems [6.431793114484429]
We propose $sqrtlog(d)/sqrtn + log(d)/[nvarepsilon]2/5$ to solve the problem of differentially-private saddle-points in the polyhedral setting.
We show that our algorithms attain a rate of $sqrtlog(d)/sqrtn + log(d)/[nvarepsilon]2/5$ with constant success.
arXiv Detail & Related papers (2024-03-05T12:28:00Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [54.28350823319057]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.<n>Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.<n>Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - Near Optimal Heteroscedastic Regression with Symbiotic Learning [29.16456701187538]
We consider the problem of heteroscedastic linear regression.
We can estimate $mathbfw*$ in squared norm up to an error of $tildeOleft(|mathbff*|2cdot left(frac1n + left(dnright)2right)$ and prove a matching lower bound.
arXiv Detail & Related papers (2023-06-25T16:32:00Z) - Mean-Square Analysis with An Application to Optimal Dimension Dependence
of Langevin Monte Carlo [60.785586069299356]
This work provides a general framework for the non-asymotic analysis of sampling error in 2-Wasserstein distance.
Our theoretical analysis is further validated by numerical experiments.
arXiv Detail & Related papers (2021-09-08T18:00:05Z) - Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist.
We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.