Trade-offs of Local SGD at Scale: An Empirical Study
- URL: http://arxiv.org/abs/2110.08133v1
- Date: Fri, 15 Oct 2021 15:00:42 GMT
- Title: Trade-offs of Local SGD at Scale: An Empirical Study
- Authors: Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos,
Nicolas Ballas
- Abstract summary: We study a technique known as local SGD to reduce communication overhead.
We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy.
We also show that incorporating the slow momentum framework consistently improves accuracy without requiring additional communication.
- Score: 24.961068070560344
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As datasets and models become increasingly large, distributed training has
become a necessary component to allow deep neural networks to train in
reasonable amounts of time. However, distributed training can have substantial
communication overhead that hinders its scalability. One strategy for reducing
this overhead is to perform multiple unsynchronized SGD steps independently on
each worker between synchronization steps, a technique known as local SGD. We
conduct a comprehensive empirical study of local SGD and related methods on a
large-scale image classification task. We find that performing local SGD comes
at a price: lower communication costs (and thereby faster training) are
accompanied by lower accuracy. This finding is in contrast from the
smaller-scale experiments in prior work, suggesting that local SGD encounters
challenges at scale. We further show that incorporating the slow momentum
framework of Wang et al. (2020) consistently improves accuracy without
requiring additional communication, hinting at future directions for
potentially escaping this trade-off.
Related papers
- Why (and When) does Local SGD Generalize Better than SGD? [46.993699881100454]
Local SGD is a communication-efficient variant of SGD for large-scale training.
This paper aims to understand why (and when) Local SGD generalizes better based on Differential Equation (SDE) approximation.
arXiv Detail & Related papers (2023-03-02T12:56:52Z) - Magnitude Matters: Fixing SIGNSGD Through Magnitude-Aware Sparsification
in the Presence of Data Heterogeneity [60.791736094073]
Communication overhead has become one of the major bottlenecks in the distributed training of deep neural networks.
We propose a magnitude-driven sparsification scheme, which addresses the non-convergence issue of SIGNSGD.
The proposed scheme is validated through experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.
arXiv Detail & Related papers (2023-02-19T17:42:35Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Local Stochastic Gradient Descent Ascent: Convergence Analysis and
Communication Efficiency [15.04034188283642]
Local SGD is a promising approach to overcome the communication overhead in distributed learning.
We show that local SGDA can provably optimize distributed minimax problems in both homogeneous and heterogeneous data.
arXiv Detail & Related papers (2021-02-25T20:15:18Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z) - Intermittent Pulling with Local Compensation for Communication-Efficient
Federated Learning [20.964434898554344]
Federated Learning is a powerful machine learning paradigm to train a global model with highly distributed data.
A major bottleneck on the performance of distributed SGD is the communication overhead on pushing local and pulling global model.
We propose a novel approach named Gradient Pulling Compensation (PRLC) to reduce communication overhead.
arXiv Detail & Related papers (2020-01-22T20:53:14Z) - Variance Reduced Local SGD with Lower Communication Complexity [52.44473777232414]
We propose Variance Reduced Local SGD to further reduce the communication complexity.
VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
arXiv Detail & Related papers (2019-12-30T08:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.