Related papers: Trade-offs of Local SGD at Scale: An Empirical Study

Trade-offs of Local SGD at Scale: An Empirical Study

URL: http://arxiv.org/abs/2110.08133v1
Date: Fri, 15 Oct 2021 15:00:42 GMT
Title: Trade-offs of Local SGD at Scale: An Empirical Study
Authors: Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos, Nicolas Ballas
Abstract summary: We study a technique known as local SGD to reduce communication overhead. We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy. We also show that incorporating the slow momentum framework consistently improves accuracy without requiring additional communication.
Score: 24.961068070560344
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time. However, distributed training can have substantial communication overhead that hinders its scalability. One strategy for reducing this overhead is to perform multiple unsynchronized SGD steps independently on each worker between synchronization steps, a technique known as local SGD. We conduct a comprehensive empirical study of local SGD and related methods on a large-scale image classification task. We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy. This finding is in contrast from the smaller-scale experiments in prior work, suggesting that local SGD encounters challenges at scale. We further show that incorporating the slow momentum framework of Wang et al. (2020) consistently improves accuracy without requiring additional communication, hinting at future directions for potentially escaping this trade-off.

Related papers

Why (and When) does Local SGD Generalize Better than SGD? [46.993699881100454]
Local SGD is a communication-efficient variant of SGD for large-scale training. This paper aims to understand why (and when) Local SGD generalizes better based on Differential Equation (SDE) approximation.
arXiv Detail & Related papers (2023-03-02T12:56:52Z)
Magnitude Matters: Fixing SIGNSGD Through Magnitude-Aware Sparsification in the Presence of Data Heterogeneity [60.791736094073]
Communication overhead has become one of the major bottlenecks in the distributed training of deep neural networks. We propose a magnitude-driven sparsification scheme, which addresses the non-convergence issue of SIGNSGD. The proposed scheme is validated through experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.
arXiv Detail & Related papers (2023-02-19T17:42:35Z)
Unifying Synergies between Self-supervised Learning and Dynamic Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms. We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting. The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z)
Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency [15.04034188283642]
Local SGD is a promising approach to overcome the communication overhead in distributed learning. We show that local SGDA can provably optimize distributed minimax problems in both homogeneous and heterogeneous data.
arXiv Detail & Related papers (2021-02-25T20:15:18Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning. We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems. We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods. We derive universal convergence rates for several applications. Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z)
Intermittent Pulling with Local Compensation for Communication-Efficient Federated Learning [20.964434898554344]
Federated Learning is a powerful machine learning paradigm to train a global model with highly distributed data. A major bottleneck on the performance of distributed SGD is the communication overhead on pushing local and pulling global model. We propose a novel approach named Gradient Pulling Compensation (PRLC) to reduce communication overhead.
arXiv Detail & Related papers (2020-01-22T20:53:14Z)
Variance Reduced Local SGD with Lower Communication Complexity [52.44473777232414]
We propose Variance Reduced Local SGD to further reduce the communication complexity. VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
arXiv Detail & Related papers (2019-12-30T08:15:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.