Removing Data Heterogeneity Influence Enhances Network Topology
Dependence of Decentralized SGD
- URL: http://arxiv.org/abs/2105.08023v1
- Date: Mon, 17 May 2021 17:16:52 GMT
- Title: Removing Data Heterogeneity Influence Enhances Network Topology
Dependence of Decentralized SGD
- Authors: Kun Yuan and Sulaiman A. Alghunaim
- Abstract summary: We study the non-asymotic convergence property of the D$2$/Exact-diffusion algorithm.
Compared with existing decentralized algorithms, D$2$/Exact-diffusion is least sensitive to network topology.
- Score: 15.112499553818953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider decentralized stochastic optimization problems where a network of
agents each owns a local cost function cooperate to find a minimizer of the
global-averaged cost. A widely studied decentralized algorithm for this problem
is D-SGD in which each node applies a stochastic gradient descent step, then
averages its estimate with its neighbors. D-SGD is attractive due to its
efficient single-iteration communication and can achieve linear speedup in
convergence (in terms of the network size). However, D-SGD is very sensitive to
the network topology. For smooth objective functions, the transient stage
(which measures how fast the algorithm can reach the linear speedup stage) of
D-SGD is on the order of $O(n/(1-\beta)^2)$ and $O(n^3/(1-\beta)^4)$ for
strongly convex and generally convex cost functions, respectively, where
$1-\beta \in (0,1)$ is a topology-dependent quantity that approaches $0$ for a
large and sparse network. Hence, D-SGD suffers from slow convergence for large
and sparse networks.
In this work, we study the non-asymptotic convergence property of the
D$^2$/Exact-diffusion algorithm. By eliminating the influence of data
heterogeneity between nodes, D$^2$/Exact-diffusion is shown to have an enhanced
transient stage that are on the order of $O(n/(1-\beta))$ and
$O(n^3/(1-\beta)^2)$ for strongly convex and generally convex cost functions,
respectively. Moreover, we provide a lower bound of the transient stage of
D-SGD under homogeneous data distributions, which coincides with the transient
stage of D$^2$/Exact-diffusion in the strongly-convex setting. These results
show that removing the influence of data heterogeneity can ameliorate the
network topology dependence of D-SGD. Compared with existing decentralized
algorithms bounds, D$^2$/Exact-diffusion is least sensitive to network
topology.
Related papers
- Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Topology-aware Generalization of Decentralized SGD [89.25765221779288]
This paper studies the generalizability of decentralized Valpha-10 stability descent (D-SGD)
We prove that the generalizability of D-SGD has a positive correlation with connectivity in initial training phase.
arXiv Detail & Related papers (2022-06-25T16:03:48Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - High-Dimensional Inference over Networks: Linear Convergence and
Statistical Guarantees [20.701475313495884]
We study a sparse linear regression over a network of agents, modeled as an undirected graph and no server node.
We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm.
arXiv Detail & Related papers (2022-01-21T01:26:08Z) - Distributed Random Reshuffling over Networks [7.013052033764372]
A distributed resh-upr (D-RR) algorithm is proposed to solve the problem of convex and smooth objective functions.
In particular, for smooth convex objective functions, D-RR achieves D-T convergence rate (where $T counts epoch number) in terms of distance between the global drives.
arXiv Detail & Related papers (2021-12-31T03:59:37Z) - Acceleration in Distributed Optimization Under Similarity [72.54787082152278]
We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes.
An $varepsilon$-solution is achieved in $tildemathcalrhoObig(sqrtfracbeta/mu (1-)log1/varepsilonbig)$ number of communications steps.
This rate matches (up to poly-log factors) for the first time lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest.
arXiv Detail & Related papers (2021-10-24T04:03:00Z) - Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig.
We then propose algorithms matching the lower bounds either types of networks (up to log-overs)
We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z) - Fast decentralized non-convex finite-sum optimization with recursive
variance reduction [19.540926205375857]
We describe a first-order gradient method, called GT-SARAH, that employs a SARAH-type variance reduction technique.
In particular, in a big-data regime such that $n = O(Nfrac12(lambda)3)$, this complexitys reduces to $O(Nfrac12Lepsilon-2)$, independent of the network complexity.
In addition, we show appropriate choices of local minibatch size balance the trade-offs between gradient complexity and communication complexity.
arXiv Detail & Related papers (2020-08-17T15:51:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.