SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized
Optimization
- URL: http://arxiv.org/abs/2005.07041v3
- Date: Mon, 11 Oct 2021 05:14:46 GMT
- Title: SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized
Optimization
- Authors: Navjot Singh, Deepesh Data, Jemin George, Suhas Diggavi
- Abstract summary: We propose and analyze SQuARM-SGD, a communication-efficient algorithm for decentralized training of machine learning models over a network.
We show that the convergence rate of SQuARM-SGD matches that of vanilla SGD with momentum updates.
We empirically show that including momentum updates in SQuARM-SGD can lead to better test performance than the current state-of-the-art which does not consider momentum updates.
- Score: 22.190763887903085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose and analyze SQuARM-SGD, a communication-efficient
algorithm for decentralized training of large-scale machine learning models
over a network. In SQuARM-SGD, each node performs a fixed number of local SGD
steps using Nesterov's momentum and then sends sparsified and quantized updates
to its neighbors regulated by a locally computable triggering criterion. We
provide convergence guarantees of our algorithm for general (non-convex) and
convex smooth objectives, which, to the best of our knowledge, is the first
theoretical analysis for compressed decentralized SGD with momentum updates. We
show that the convergence rate of SQuARM-SGD matches that of vanilla SGD. We
empirically show that including momentum updates in SQuARM-SGD can lead to
better test performance than the current state-of-the-art which does not
consider momentum updates.
Related papers
- Stability and Generalization for Distributed SGDA [70.97400503482353]
We propose the stability-based generalization analytical framework for Distributed-SGDA.
We conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics.
Our theoretical results reveal the trade-off between the generalization gap and optimization error.
arXiv Detail & Related papers (2024-11-14T11:16:32Z) - Ordered Momentum for Asynchronous SGD [12.810976838406193]
We propose a novel method called momentum (OrMo) for ASGD.
In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their indexes.
Empirical results demonstrate that OrMo can achieve better convergence performance compared with ASGD.
arXiv Detail & Related papers (2024-07-27T11:35:19Z) - Accurate and Scalable Estimation of Epistemic Uncertainty for Graph
Neural Networks [40.95782849532316]
Confidence indicators (CIs) are crucial for safe deployment of graph neural networks (GNNs) under distribution shift.
We show that increased expressivity or model size do not always lead to improved CI performance.
We propose G-$$UQ, a new single model UQ method that extends the recently proposed framework.
Overall, our work not only introduces a new, flexible GNN UQ method, but also provides novel insights into GNN CIs on safety-critical tasks.
arXiv Detail & Related papers (2023-09-20T00:35:27Z) - Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training [30.574484395380043]
Decentralized momentum SGD (DmSGD) is more communication efficient than Parallel momentum SGD that incurs global average across all computing nodes.
We propose DeLacent large-batch momentum performance models.
arXiv Detail & Related papers (2021-04-24T16:21:01Z) - OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed
Training [5.888925582071453]
We propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process.
We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2020-05-14T05:33:36Z) - A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z) - Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays
in Distributed SGD [32.03967072200476]
We propose an algorithmic approach named OverlapLocal-Local-Local-SGD (Local momentum variant)
We achieve this by adding an anchor model on each node.
After multiple local updates, locally trained models will be pulled back towards the anchor model rather than communicating with others.
arXiv Detail & Related papers (2020-02-21T20:33:49Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.