Minibatch vs Local SGD for Heterogeneous Distributed Learning
- URL: http://arxiv.org/abs/2006.04735v5
- Date: Tue, 1 Mar 2022 09:30:34 GMT
- Title: Minibatch vs Local SGD for Heterogeneous Distributed Learning
- Authors: Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro
- Abstract summary: We argue that Minibatch SGD dominates all existing analysis of Local SGD in this setting.
We present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.
- Score: 28.80878557506603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the
heterogeneous distributed setting, where each machine has access to stochastic
gradient estimates for a different, machine-specific, convex objective; the
goal is to optimize w.r.t. the average objective; and machines can only
communicate intermittently. We argue that, (i) Minibatch SGD (even without
acceleration) dominates all existing analysis of Local SGD in this setting,
(ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and
(iii) present the first upper bound for Local SGD that improves over Minibatch
SGD in a non-homogeneous regime.
Related papers
- Stability and Generalization for Distributed SGDA [70.97400503482353]
We propose the stability-based generalization analytical framework for Distributed-SGDA.
We conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics.
Our theoretical results reveal the trade-off between the generalization gap and optimization error.
arXiv Detail & Related papers (2024-11-14T11:16:32Z) - The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice.
We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions.
We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z) - Stability and Generalization for Minibatch SGD and Local SGD [46.45496260281998]
Minibatch gradient descent (minibatch SGD) and local SGD are two popular methods for parallel optimization.
We study the stability and generalization analysis of minibatch and local SGD to understand their learnability.
We show both minibatch and local SGD achieve a linear speedup to attain the optimal risk bounds.
arXiv Detail & Related papers (2023-10-02T12:26:51Z) - Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z) - SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex
Optimization [12.709177728330399]
We design the first local update method that provably benefits over the two most prominent distributed baselines: Minibatch-SGD and Local-SGD.
Key to our approach is a slow querying technique that we customize to the distributed setting, which in turn enables a better mitigation of the bias caused by local updates.
arXiv Detail & Related papers (2023-04-09T06:10:49Z) - SGDA with shuffling: faster convergence for nonconvex-P{\L} minimax
optimization [18.668531108219415]
We present a theoretical approach for solving minimax optimization problems using sequentially descent-ascent gradient (SGDA)
We analyze both simultaneous and alternating SGDA-LL objectives for non-concave objectives with Polyak-Lojasiewicz (PL) geometry.
Our rates also extend to mini-batch-GDARR, recovering few known rates for full gradient-batch descent-ascent gradient (GDA)
Finally, we present a comprehensive lower bound for two-time-scale GDA, which matches the full rate for primal-PL-
arXiv Detail & Related papers (2022-10-12T08:05:41Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z) - Is Local SGD Better than Minibatch SGD? [60.42437186984968]
We show how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD.
We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
arXiv Detail & Related papers (2020-02-18T19:22:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.