Related papers: DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

URL: http://arxiv.org/abs/2104.11981v1
Date: Sat, 24 Apr 2021 16:21:01 GMT
Title: DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training
Authors: Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui Xu, Wotao Yin
Abstract summary: Decentralized momentum SGD (DmSGD) is more communication efficient than Parallel momentum SGD that incurs global average across all computing nodes. We propose DeLacent large-batch momentum performance models.
Score: 30.574484395380043
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The scale of deep learning nowadays calls for efficient distributed training algorithms. Decentralized momentum SGD (DmSGD), in which each node averages only with its neighbors, is more communication efficient than vanilla Parallel momentum SGD that incurs global average across all computing nodes. On the other hand, the large-batch training has been demonstrated critical to achieve runtime speedup. This motivates us to investigate how DmSGD performs in the large-batch scenario. In this work, we find the momentum term can amplify the inconsistency bias in DmSGD. Such bias becomes more evident as batch-size grows large and hence results in severe performance degradation. We next propose DecentLaM, a novel decentralized large-batch momentum SGD to remove the momentum-incurred bias. The convergence rate for both non-convex and strongly-convex scenarios is established. Our theoretical results justify the superiority of DecentLaM to DmSGD especially in the large-batch scenario. Experimental results on a variety of computer vision tasks and models demonstrate that DecentLaM promises both efficient and high-quality training.

Related papers

Improved Diffusion-based Generative Model with Better Adversarial Robustness [65.38540020916432]
Diffusion Probabilistic Models (DPMs) have achieved significant success in generative tasks. During the denoising process, the input data distributions differ between the training and inference stages.
arXiv Detail & Related papers (2025-02-24T12:29:16Z)
Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum [0.6906005491572401]
gradient descent with momentum (SGDM) has been well studied in both theory and practice. We focus on mini-batch SGDM with constant learning rate and constant momentum weight.
arXiv Detail & Related papers (2025-01-15T15:53:27Z)
When and Why Momentum Accelerates SGD:An Empirical Study [76.2666927020119]
This study examines the performance of gradient descent (SGD) with momentum (SGDM) We find that the momentum acceleration is closely related to emphabrupt sharpening which is to describe a sudden jump of the directional Hessian along the update direction. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening.
arXiv Detail & Related papers (2023-06-15T09:54:21Z)
Solving Large-scale Spatial Problems with Convolutional Neural Networks [88.31876586547848]
We employ transfer learning to improve training efficiency for large-scale spatial problems. We propose that a convolutional neural network (CNN) can be trained on small windows of signals, but evaluated on arbitrarily large signals with little to no performance degradation.
arXiv Detail & Related papers (2023-06-14T01:24:42Z)
Fast Diffusion Model [122.36693015093041]
Diffusion models (DMs) have been adopted across diverse fields with their abilities in capturing intricate data distributions. In this paper, we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a DM optimization perspective.
arXiv Detail & Related papers (2023-06-12T09:38:04Z)
Decentralized SGD and Average-direction SAM are Asymptotically Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z)
Scalable Optimal Margin Distribution Machine [50.281535710689795]
Optimal margin Distribution Machine (ODM) is a newly proposed statistical learning framework rooting in the novel margin theory. This paper proposes a scalable ODM, which can achieve nearly ten times speedup compared to the original ODM training method.
arXiv Detail & Related papers (2023-05-08T16:34:04Z)
Contrastive Weight Regularization for Large Minibatch SGD [8.927483136015283]
We introduce a novel regularization technique, namely distinctive regularization (DReg) DReg replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. We empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved performance.
arXiv Detail & Related papers (2020-11-17T22:07:38Z)
Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training [9.964630991617764]
gradient descent(SGD) and its variants have been the dominating optimization methods in machine learning. In this paper, we propose a simple yet effective method, called normalized gradient descent with momentum(SNGM) for largebatch training.
arXiv Detail & Related papers (2020-07-28T04:34:43Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization [22.190763887903085]
We propose and analyze SQuARM-SGD, a communication-efficient algorithm for decentralized training of machine learning models over a network. We show that the convergence rate of SQuARM-SGD matches that of vanilla SGD with momentum updates. We empirically show that including momentum updates in SQuARM-SGD can lead to better test performance than the current state-of-the-art which does not consider momentum updates.
arXiv Detail & Related papers (2020-05-13T02:11:14Z)
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging [34.55741812648229]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange. We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.