DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training
- URL: http://arxiv.org/abs/2104.11981v1
- Date: Sat, 24 Apr 2021 16:21:01 GMT
- Title: DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training
- Authors: Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui
Xu, Wotao Yin
- Abstract summary: Decentralized momentum SGD (DmSGD) is more communication efficient than Parallel momentum SGD that incurs global average across all computing nodes.
We propose DeLacent large-batch momentum performance models.
- Score: 30.574484395380043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scale of deep learning nowadays calls for efficient distributed training
algorithms. Decentralized momentum SGD (DmSGD), in which each node averages
only with its neighbors, is more communication efficient than vanilla Parallel
momentum SGD that incurs global average across all computing nodes. On the
other hand, the large-batch training has been demonstrated critical to achieve
runtime speedup. This motivates us to investigate how DmSGD performs in the
large-batch scenario.
In this work, we find the momentum term can amplify the inconsistency bias in
DmSGD. Such bias becomes more evident as batch-size grows large and hence
results in severe performance degradation. We next propose DecentLaM, a novel
decentralized large-batch momentum SGD to remove the momentum-incurred bias.
The convergence rate for both non-convex and strongly-convex scenarios is
established. Our theoretical results justify the superiority of DecentLaM to
DmSGD especially in the large-batch scenario. Experimental results on a variety
of computer vision tasks and models demonstrate that DecentLaM promises both
efficient and high-quality training.
Related papers
- When and Why Momentum Accelerates SGD:An Empirical Study [76.2666927020119]
This study examines the performance of gradient descent (SGD) with momentum (SGDM)
We find that the momentum acceleration is closely related to emphabrupt sharpening which is to describe a sudden jump of the directional Hessian along the update direction.
Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening.
arXiv Detail & Related papers (2023-06-15T09:54:21Z) - Solving Large-scale Spatial Problems with Convolutional Neural Networks [88.31876586547848]
We employ transfer learning to improve training efficiency for large-scale spatial problems.
We propose that a convolutional neural network (CNN) can be trained on small windows of signals, but evaluated on arbitrarily large signals with little to no performance degradation.
arXiv Detail & Related papers (2023-06-14T01:24:42Z) - Fast Diffusion Model [122.36693015093041]
Diffusion models (DMs) have been adopted across diverse fields with their abilities in capturing intricate data distributions.
In this paper, we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a DM optimization perspective.
arXiv Detail & Related papers (2023-06-12T09:38:04Z) - Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z) - Scalable Optimal Margin Distribution Machine [50.281535710689795]
Optimal margin Distribution Machine (ODM) is a newly proposed statistical learning framework rooting in the novel margin theory.
This paper proposes a scalable ODM, which can achieve nearly ten times speedup compared to the original ODM training method.
arXiv Detail & Related papers (2023-05-08T16:34:04Z) - Contrastive Weight Regularization for Large Minibatch SGD [8.927483136015283]
We introduce a novel regularization technique, namely distinctive regularization (DReg)
DReg replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse.
We empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved performance.
arXiv Detail & Related papers (2020-11-17T22:07:38Z) - Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training [9.964630991617764]
gradient descent(SGD) and its variants have been the dominating optimization methods in machine learning.
In this paper, we propose a simple yet effective method, called normalized gradient descent with momentum(SNGM) for largebatch training.
arXiv Detail & Related papers (2020-07-28T04:34:43Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized
Optimization [22.190763887903085]
We propose and analyze SQuARM-SGD, a communication-efficient algorithm for decentralized training of machine learning models over a network.
We show that the convergence rate of SQuARM-SGD matches that of vanilla SGD with momentum updates.
We empirically show that including momentum updates in SQuARM-SGD can lead to better test performance than the current state-of-the-art which does not consider momentum updates.
arXiv Detail & Related papers (2020-05-13T02:11:14Z) - Breaking (Global) Barriers in Parallel Stochastic Optimization with
Wait-Avoiding Group Averaging [34.55741812648229]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.
We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.