Is Local SGD Better than Minibatch SGD?
- URL: http://arxiv.org/abs/2002.07839v2
- Date: Mon, 20 Jul 2020 15:47:48 GMT
- Title: Is Local SGD Better than Minibatch SGD?
- Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai,
Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro
- Abstract summary: We show how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD.
We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
- Score: 60.42437186984968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study local SGD (also known as parallel SGD and federated averaging), a
natural and frequently used stochastic distributed optimization method. Its
theoretical foundations are currently lacking and we highlight how all existing
error guarantees in the convex setting are dominated by a simple baseline,
minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly
dominates minibatch SGD and that accelerated local SGD is minimax optimal for
quadratics; (2) For general convex objectives we provide the first guarantee
that at least sometimes improves over minibatch SGD; (3) We show that indeed
local SGD does not dominate minibatch SGD by presenting a lower bound on the
performance of local SGD that is worse than the minibatch SGD guarantee.
Related papers
- Stability and Generalization for Distributed SGDA [70.97400503482353]
We propose the stability-based generalization analytical framework for Distributed-SGDA.
We conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics.
Our theoretical results reveal the trade-off between the generalization gap and optimization error.
arXiv Detail & Related papers (2024-11-14T11:16:32Z) - The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice.
We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions.
We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z) - Stability and Generalization for Minibatch SGD and Local SGD [46.45496260281998]
Minibatch gradient descent (minibatch SGD) and local SGD are two popular methods for parallel optimization.
We study the stability and generalization analysis of minibatch and local SGD to understand their learnability.
We show both minibatch and local SGD achieve a linear speedup to attain the optimal risk bounds.
arXiv Detail & Related papers (2023-10-02T12:26:51Z) - Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z) - Why (and When) does Local SGD Generalize Better than SGD? [46.993699881100454]
Local SGD is a communication-efficient variant of SGD for large-scale training.
This paper aims to understand why (and when) Local SGD generalizes better based on Differential Equation (SDE) approximation.
arXiv Detail & Related papers (2023-03-02T12:56:52Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - SGD with a Constant Large Learning Rate Can Converge to Local Maxima [4.014524824655106]
We construct worst-case optimization problems illustrating that gradient descent can exhibit strange and potentially undesirable behaviors.
Specifically, we construct landscapes and data distributions such that SGD converges to local maxima.
Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes.
arXiv Detail & Related papers (2021-07-25T10:12:18Z) - Minibatch vs Local SGD for Heterogeneous Distributed Learning [28.80878557506603]
We argue that Minibatch SGD dominates all existing analysis of Local SGD in this setting.
We present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.
arXiv Detail & Related papers (2020-06-08T16:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.