Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex
Optimization
- URL: http://arxiv.org/abs/2310.15976v2
- Date: Thu, 28 Dec 2023 00:59:10 GMT
- Title: Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex
Optimization
- Authors: Zhen Qin, Zhishuai Liu, Pan Xu
- Abstract summary: Existing analyses of signSGD rely on assuming that data are with replacement in each iteration.
We show that Sign has the same $O(log(nT)/stnT + log (nT)sqrtn/sT)$.
We back up theoretical findings through experiments on simulated real-world problems.
- Score: 14.69450310695133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: signSGD is popular in nonconvex optimization due to its communication
efficiency. Yet, existing analyses of signSGD rely on assuming that data are
sampled with replacement in each iteration, contradicting the practical
implementation where data are randomly reshuffled and sequentially fed into the
algorithm. We bridge this gap by proving the first convergence result of
signSGD with random reshuffling (SignRR) for nonconvex optimization. Given the
dataset size $n$, the number of epochs of data passes $T$, and the variance
bound of a stochastic gradient $\sigma^2$, we show that SignRR has the same
convergence rate $O(\log(nT)/\sqrt{nT} + \|\sigma\|_1)$ as signSGD
\citep{bernstein2018signsgd}. We then present SignRVR and SignRVM, which
leverage variance-reduced gradients and momentum updates respectively, both
converging at $O(\log (nT)/\sqrt{nT} + \log (nT)\sqrt{n}/\sqrt{T})$. In
contrast with the analysis of signSGD, our results do not require an extremely
large batch size in each iteration to be of the same order as the total number
of iterations \citep{bernstein2018signsgd} or the signs of stochastic and true
gradients match element-wise with a minimum probability of 1/2
\citep{safaryan2021stochastic}. We also extend our algorithms to cases where
data are distributed across different machines, yielding dist-SignRVR and
dist-SignRVM, both converging at $O(\log (n_0T)/\sqrt{n_0T} + \log
(n_0T)\sqrt{n_0}/\sqrt{T})$, where $n_0$ is the dataset size of a single
machine. We back up our theoretical findings through experiments on simulated
and real-world problems, verifying that randomly reshuffled sign methods match
or surpass existing baselines.
Related papers
- Variable Selection in Convex Piecewise Linear Regression [5.366354612549172]
This paper presents Sparse Gradient as a solution for variable selection in convex piecewise linear regression.
A non-asymptotic local convergence analysis is provided for SpGD under subGaussian noise.
arXiv Detail & Related papers (2024-11-04T16:19:09Z) - Inverting the Leverage Score Gradient: An Efficient Approximate Newton Method [10.742859956268655]
This paper aims to recover the intrinsic model parameters given the leverage scores gradient.
We specifically scrutinize the inversion of the leverage score gradient, denoted as $g(x)$.
arXiv Detail & Related papers (2024-08-21T01:39:42Z) - On the Convergence of a Federated Expectation-Maximization Algorithm [0.0]
We study the convergence rate of the Expectation-Maximization (EM) algorithm for the Federated Mixture of $K$ Linear Regressions model.
Surprisingly, the results show that rather than being a bottleneck, data heterogeneity can accelerate the convergence of Federated Learning algorithms.
arXiv Detail & Related papers (2024-08-11T16:46:42Z) - Efficient Sign-Based Optimization: Accelerating Convergence via Variance Reduction [16.82220229840038]
We introduce two novel algorithms that attain improved convergence rates of $mathcalO(d1/2T-1/2 + dn-1/2)$ and $mathcalO(d1/4T-1/4)$ respectively.
Numerical experiments across different tasks validate the effectiveness of our proposed methods.
arXiv Detail & Related papers (2024-06-01T16:38:43Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig.
We then propose algorithms matching the lower bounds either types of networks (up to log-overs)
We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - On the Minimax Optimality of the EM Algorithm for Learning Two-Component
Mixed Linear Regression [30.09238179576628]
We study the convergence rates of the EM algorithm for learning two-component mixed linear regression under all regimes of signal-to-noise ratio (SNR)
We show that the EM algorithm achieves minimax optimal sample complexity under all SNR regimes.
arXiv Detail & Related papers (2020-06-04T00:56:07Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z) - Tight Regret Bounds for Noisy Optimization of a Brownian Motion [118.65407541895851]
We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise.
We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $Omega(sigmasqrtT cdot log T)$.
arXiv Detail & Related papers (2020-01-25T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.