Convergence Analysis of Decentralized ASGD
- URL: http://arxiv.org/abs/2309.03754v1
- Date: Thu, 7 Sep 2023 14:50:31 GMT
- Title: Convergence Analysis of Decentralized ASGD
- Authors: Mauro DL Tosi, Martin Theobald
- Abstract summary: We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
- Score: 1.8710230264817358
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Over the last decades, Stochastic Gradient Descent (SGD) has been intensively
studied by the Machine Learning community. Despite its versatility and
excellent performance, the optimization of large models via SGD still is a
time-consuming task. To reduce training time, it is common to distribute the
training process across multiple devices. Recently, it has been shown that the
convergence of asynchronous SGD (ASGD) will always be faster than mini-batch
SGD. However, despite these improvements in the theoretical bounds, most ASGD
convergence-rate proofs still rely on a centralized parameter server, which is
prone to become a bottleneck when scaling out the gradient computations across
many distributed processes.
In this paper, we present a novel convergence-rate analysis for decentralized
and asynchronous SGD (DASGD) which does not require partial synchronization
among nodes nor restrictive network topologies. Specifically, we provide a
bound of $\mathcal{O}(\sigma\epsilon^{-2}) +
\mathcal{O}(QS_{avg}\epsilon^{-3/2}) + \mathcal{O}(S_{avg}\epsilon^{-1})$ for
the convergence rate of DASGD, where $S_{avg}$ is the average staleness between
models, $Q$ is a constant that bounds the norm of the gradients, and $\epsilon$
is a (small) error that is allowed within the bound. Furthermore, when
gradients are not bounded, we prove the convergence rate of DASGD to be
$\mathcal{O}(\sigma\epsilon^{-2}) +
\mathcal{O}(\sqrt{\hat{S}_{avg}\hat{S}_{max}}\epsilon^{-1})$, with
$\hat{S}_{max}$ and $\hat{S}_{avg}$ representing a loose version of the average
and maximum staleness, respectively. Our convergence proof holds for a fixed
stepsize and any non-convex, homogeneous, and L-smooth objective function. We
anticipate that our results will be of high relevance for the adoption of DASGD
by a broad community of researchers and developers.
Related papers
- MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.
Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.
We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Demystifying the Myths and Legends of Nonconvex Convergence of SGD [17.445810977264067]
gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems.
As our analyses, we addressed certain myths and legends related to the non convergence of the gradient.
arXiv Detail & Related papers (2023-10-19T17:58:59Z) - Scaling up Stochastic Gradient Descent for Non-convex Optimisation [5.908471365011942]
We propose a novel approach to the problem of shared parallel computation.
By combining two strategies into a unified framework, DPSGD is a better trade computation framework.
The potential gains can be achieved by DPSGD on a deep learning (DRL) problem (Latent Diletrichal inference) and on a deep learning (DRL) problem (advantage actor - A2C)
arXiv Detail & Related papers (2022-10-06T13:06:08Z) - Topology-aware Generalization of Decentralized SGD [89.25765221779288]
This paper studies the generalizability of decentralized Valpha-10 stability descent (D-SGD)
We prove that the generalizability of D-SGD has a positive correlation with connectivity in initial training phase.
arXiv Detail & Related papers (2022-06-25T16:03:48Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - On the Last Iterate Convergence of Momentum Methods [32.60494006038902]
We prove for the first time that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate suffers from an error.
We show that in the setting with convex and smooth functions, our new SGDM algorithm automatically converges at a rate of $O(fraclog TT)$.
arXiv Detail & Related papers (2021-02-13T21:16:16Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.