Guided parallelized stochastic gradient descent for delay compensation
- URL: http://arxiv.org/abs/2101.07259v1
- Date: Sun, 17 Jan 2021 23:12:40 GMT
- Title: Guided parallelized stochastic gradient descent for delay compensation
- Authors: Anuraganand Sharma
- Abstract summary: gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models.
With the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its natural behavior of sequential optimization of the error function.
This has led to the development of parallel SGD algorithms, such as asynchronous SGD (ASGD) and synchronous SGD (SSGD) to train deep neural networks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Stochastic gradient descent (SGD) algorithm and its variations have been
effectively used to optimize neural network models. However, with the rapid
growth of big data and deep learning, SGD is no longer the most suitable choice
due to its natural behavior of sequential optimization of the error function.
This has led to the development of parallel SGD algorithms, such as
asynchronous SGD (ASGD) and synchronous SGD (SSGD) to train deep neural
networks. However, it introduces a high variance due to the delay in parameter
(weight) update. We address this delay in our proposed algorithm and try to
minimize its impact. We employed guided SGD (gSGD) that encourages consistent
examples to steer the convergence by compensating the unpredictable deviation
caused by the delay. Its convergence rate is also similar to A/SSGD, however,
some additional (parallel) processing is required to compensate for the delay.
The experimental results demonstrate that our proposed approach has been able
to mitigate the impact of delay for the quality of classification accuracy. The
guided approach with SSGD clearly outperforms sequential SGD and even achieves
the accuracy close to sequential SGD for some benchmark datasets.
Related papers
- Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - Towards Understanding the Generalizability of Delayed Stochastic
Gradient Descent [63.43247232708004]
A gradient descent performed in an asynchronous manner plays a crucial role in training large-scale machine learning models.
Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization.
Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm.
arXiv Detail & Related papers (2023-08-18T10:00:27Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays [8.46491234455848]
We prove much better guarantees for the same asynchronous gradient regardless of the delays in steps, depending instead just on the number of steps.
For our analysis, we introduce a novel based on "virtual steps" and delay iterations, which allow us to derive state-of-the-art guarantees for both convex non-adaptive gradients.
arXiv Detail & Related papers (2022-06-15T16:28:37Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - Slow and Stale Gradients Can Win the Race [39.750046808758526]
Distributed Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers)
Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect the convergence error.
We present a novel theoretical characterization of the speedup offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime.
arXiv Detail & Related papers (2020-03-23T23:27:50Z) - Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent [32.40217829362088]
We propose a new NAG-style scheme for training deep neural networks (DNNs)
SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule.
On both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.
arXiv Detail & Related papers (2020-02-24T23:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.