The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity
- URL: http://arxiv.org/abs/2403.15654v2
- Date: Wed, 11 Sep 2024 05:11:33 GMT
- Title: The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity
- Authors: Tongle Wu, Ying Sun,
- Abstract summary: We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD)
We demonstrate that incorporating $K > 1$ local update steps can reduce communication complexity.
- Score: 2.845817138242963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating $K > 1$ local update steps can reduce communication complexity. Specifically, for $\mu$-strongly convex and $L$-smooth loss functions, we proved that local DGT achieves communication complexity $\tilde{\mathcal{O}} \Big(\frac{L}{\mu K} + \frac{\delta}{\mu (1 - \rho)} + \frac{\rho }{(1 - \rho)^2} \cdot \frac{L+ \delta}{\mu}\Big)$, where $\rho$ measures the network connectivity and $\delta$ measures the second-order heterogeneity of the local loss. Our result reveals the tradeoff between communication and computation and shows increasing $K$ can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums, we proved that employing local updates in DGD, even without gradient correction, can yield a similar effect as DGT in reducing communication complexity. Numerical experiments validate our theoretical results.
Related papers
- A Proximal Gradient Method With Probabilistic Multi-Gossip Communications for Decentralized Composite Optimization [36.777745196161035]
We propose a communication-efficient method MG-Skip for decentralized composite (smooth + nonsmooth) optimization.
We show that MG-Skip achieves the optimal communication complexity and confirm the benefits of local updates in the nonsmooth setup.
arXiv Detail & Related papers (2023-12-19T05:13:16Z) - DFedADMM: Dual Constraints Controlled Model Inconsistency for
Decentralized Federated Learning [52.83811558753284]
Decentralized learning (DFL) discards the central server and establishes a decentralized communication network.
Existing DFL methods still suffer from two major challenges: local inconsistency and local overfitting.
arXiv Detail & Related papers (2023-08-16T11:22:36Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - On the Performance of Gradient Tracking with Local Updates [10.14746251086461]
We show that LU-GT has the same communication quality but allows arbitrary network topologies.
Numerical examples reveal that local updates can lower communication costs in certain regimes.
arXiv Detail & Related papers (2022-10-10T15:13:23Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD
for Communication Efficient Nonconvex Distributed Learning [58.79085525115987]
Local methods are one of the promising approaches to reduce communication time.
We show that the communication complexity is better than non-local methods when the local datasets is smaller than the smoothness local loss.
arXiv Detail & Related papers (2022-02-12T15:12:17Z) - Acceleration in Distributed Optimization Under Similarity [72.54787082152278]
We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes.
An $varepsilon$-solution is achieved in $tildemathcalrhoObig(sqrtfracbeta/mu (1-)log1/varepsilonbig)$ number of communications steps.
This rate matches (up to poly-log factors) for the first time lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest.
arXiv Detail & Related papers (2021-10-24T04:03:00Z) - Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig.
We then propose algorithms matching the lower bounds either types of networks (up to log-overs)
We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z) - Communication-efficient SGD: From Local SGD to One-Shot Averaging [16.00658606157781]
We consider speeding up gradient descent (SGD) by parallelizing it across multiple workers.
We suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows.
arXiv Detail & Related papers (2021-06-09T01:10:34Z) - Removing Data Heterogeneity Influence Enhances Network Topology
Dependence of Decentralized SGD [15.112499553818953]
We study the non-asymotic convergence property of the D$2$/Exact-diffusion algorithm.
Compared with existing decentralized algorithms, D$2$/Exact-diffusion is least sensitive to network topology.
arXiv Detail & Related papers (2021-05-17T17:16:52Z) - Hogwild! over Distributed Local Data Sets with Linearly Increasing
Mini-Batch Sizes [26.9902939745173]
Hogwild! implements asynchronous Gradient Descent where multiple threads in parallel access a common repository containing training data.
We show how local compute nodes can start choosing small mini-batch sizes which increase to larger ones in order to reduce communication cost.
arXiv Detail & Related papers (2020-10-27T03:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.