A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates
- URL: http://arxiv.org/abs/2003.10422v3
- Date: Tue, 2 Mar 2021 14:07:36 GMT
- Title: A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates
- Authors: Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi,
Sebastian U. Stich
- Abstract summary: We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
- Score: 70.9701218475002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decentralized stochastic optimization methods have gained a lot of attention
recently, mainly because of their cheap per iteration cost, data locality, and
their communication-efficiency. In this paper we introduce a unified
convergence analysis that covers a large variety of decentralized SGD methods
which so far have required different intuitions, have different applications,
and which have been developed separately in various communities.
Our algorithmic framework covers local SGD updates and synchronous and
pairwise gossip updates on adaptive network topology. We derive universal
convergence rates for smooth (convex and non-convex) problems and the rates
interpolate between the heterogeneous (non-identically distributed data) and
iid-data settings, recovering linear convergence rates in many special cases,
for instance for over-parametrized models. Our proofs rely on weak assumptions
(typically improving over prior work in several aspects) and recover (and
improve) the best known complexity results for a host of important scenarios,
such as for instance coorperative SGD and federated averaging (local SGD).
Related papers
- The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice.
We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions.
We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z) - Stability and Generalization of the Decentralized Stochastic Gradient
Descent Ascent Algorithm [80.94861441583275]
We investigate the complexity of the generalization bound of the decentralized gradient descent (D-SGDA) algorithm.
Our results analyze the impact of different top factors on the generalization of D-SGDA.
We also balance it with the generalization to obtain the optimal convex-concave setting.
arXiv Detail & Related papers (2023-10-31T11:27:01Z) - Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient
Methods [73.35353358543507]
Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP)
In this paper, we propose a unified convergence analysis that covers a large variety of descent-ascent methods.
We develop several new variants of SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate randomization (SEGA-SGDA)
arXiv Detail & Related papers (2022-02-15T09:17:39Z) - Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD
for Communication Efficient Nonconvex Distributed Learning [58.79085525115987]
Local methods are one of the promising approaches to reduce communication time.
We show that the communication complexity is better than non-local methods when the local datasets is smaller than the smoothness local loss.
arXiv Detail & Related papers (2022-02-12T15:12:17Z) - On the Convergence of Heterogeneous Federated Learning with Arbitrary
Adaptive Online Model Pruning [15.300983585090794]
We present a unifying framework for heterogeneous FL algorithms with em arbitrary adaptive online model pruning.
In particular, we prove that under certain sufficient conditions, these algorithms converge to a stationary point of standard FL for general smooth cost functions.
We illuminate two key factors impacting convergence: pruning-induced noise and minimum coverage index.
arXiv Detail & Related papers (2022-01-27T20:43:38Z) - Decentralized Local Stochastic Extra-Gradient for Variational
Inequalities [125.62877849447729]
We consider distributed variational inequalities (VIs) on domains with the problem data that is heterogeneous (non-IID) and distributed across many devices.
We make a very general assumption on the computational network that covers the settings of fully decentralized calculations.
We theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone settings.
arXiv Detail & Related papers (2021-06-15T17:45:51Z) - Local Stochastic Gradient Descent Ascent: Convergence Analysis and
Communication Efficiency [15.04034188283642]
Local SGD is a promising approach to overcome the communication overhead in distributed learning.
We show that local SGDA can provably optimize distributed minimax problems in both homogeneous and heterogeneous data.
arXiv Detail & Related papers (2021-02-25T20:15:18Z) - Local SGD: Unified Theory and New Efficient Methods [8.701566919381223]
We present a unified framework for analyzing local SGD methods in the convex and strongly convex regimes.
We develop the first linearly converging local SGD method which does not require any data homogeneity or other strong assumptions.
arXiv Detail & Related papers (2020-11-03T13:02:50Z) - Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments.
We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data.
Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.