Detached Error Feedback for Distributed SGD with Random Sparsification
- URL: http://arxiv.org/abs/2004.05298v3
- Date: Mon, 13 Jun 2022 13:19:06 GMT
- Title: Detached Error Feedback for Distributed SGD with Random Sparsification
- Authors: An Xu, Heng Huang
- Abstract summary: Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
- Score: 98.98236187442258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The communication bottleneck has been a critical problem in large-scale
distributed deep learning. In this work, we study distributed SGD with random
block-wise sparsification as the gradient compressor, which is ring-allreduce
compatible and highly computation-efficient but leads to inferior performance.
To tackle this important issue, we improve the communication-efficient
distributed SGD from a novel aspect, that is, the trade-off between the
variance and second moment of the gradient. With this motivation, we propose a
new detached error feedback (DEF) algorithm, which shows better convergence
bound than error feedback for non-convex problems. We also propose DEF-A to
accelerate the generalization of DEF at the early stages of the training, which
shows better generalization bounds than DEF. Furthermore, we establish the
connection between communication-efficient distributed SGD and SGD with iterate
averaging (SGD-IA) for the first time. Extensive deep learning experiments show
significant empirical improvement of the proposed methods under various
settings.
Related papers
- Magnitude Matters: Fixing SIGNSGD Through Magnitude-Aware Sparsification
in the Presence of Data Heterogeneity [60.791736094073]
Communication overhead has become one of the major bottlenecks in the distributed training of deep neural networks.
We propose a magnitude-driven sparsification scheme, which addresses the non-convergence issue of SIGNSGD.
The proposed scheme is validated through experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.
arXiv Detail & Related papers (2023-02-19T17:42:35Z) - Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance.
numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Compressing gradients by exploiting temporal correlation in momentum-SGD [17.995905582226463]
We analyze compression methods that exploit temporal correlation in systems with and without error-feedback.
Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication.
We prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.
arXiv Detail & Related papers (2021-08-17T18:04:06Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - FedADC: Accelerated Federated Learning with Drift Control [6.746400031322727]
Federated learning (FL) has become de facto framework for collaborative learning among edge devices with privacy concern.
Large scale implementation of FL brings new challenges, such as the incorporation of acceleration techniques designed for SGD into the distributed setting, and mitigation of the drift problem due to non-homogeneous distribution of local datasets.
We show that it is possible to address both problems using a single strategy without any major alteration to the FL framework, or introducing additional computation and communication load.
We propose FedADC, which is an accelerated FL algorithm with drift control.
arXiv Detail & Related papers (2020-12-16T17:49:37Z) - Linearly Converging Error Compensated SGD [11.436753102510647]
We propose a unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates.
Our framework is general enough to cover different variants of quantized SGD, ErrorCompensated SGD and SGD with delayed updates.
We develop new variants of SGD that combine variance reduction or arbitrary sampling with error feedback and quantization.
arXiv Detail & Related papers (2020-10-23T10:46:31Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees [49.91477656517431]
Quantization-based solvers have been widely adopted in Federated Learning (FL)
No existing methods enjoy all the aforementioned properties.
We propose an intuitively-simple yet theoretically-simple method based on SIGNSGD to bridge the gap.
arXiv Detail & Related papers (2020-02-25T15:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.