Asynchronous Training Schemes in Distributed Learning with Time Delay
- URL: http://arxiv.org/abs/2208.13154v1
- Date: Sun, 28 Aug 2022 07:14:59 GMT
- Title: Asynchronous Training Schemes in Distributed Learning with Time Delay
- Authors: Haoxiang Wang, Zhanhong Jiang, Chao Liu, Soumik Sarkar, Dongxiang
Jiang, Young M. Lee
- Abstract summary: In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance.
In this paper, we propose a different approach to tackle the issue of stale weights or gradients.
One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter.
- Score: 17.259708772713164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the context of distributed deep learning, the issue of stale weights or
gradients could result in poor algorithmic performance. This issue is usually
tackled by delay tolerant algorithms with some mild assumptions on the
objective functions and step sizes. In this paper, we propose a different
approach to develop a new algorithm, called $\textbf{P}$redicting
$\textbf{C}$lipping $\textbf{A}$synchronous $\textbf{S}$tochastic
$\textbf{G}$radient $\textbf{D}$escent (aka, PC-ASGD). Specifically, PC-ASGD
has two steps - the $\textit{predicting step}$ leverages the gradient
prediction using Taylor expansion to reduce the staleness of the outdated
weights while the $\textit{clipping step}$ selectively drops the outdated
weights to alleviate their negative effects. A tradeoff parameter is introduced
to balance the effects between these two steps. Theoretically, we present the
convergence rate considering the effects of delay of the proposed algorithm
with constant step size when the smooth objective functions are weakly
strongly-convex and nonconvex. One practical variant of PC-ASGD is also
proposed by adopting a condition to help with the determination of the tradeoff
parameter. For empirical validation, we demonstrate the performance of the
algorithm with two deep neural network architectures on two benchmark datasets.
Related papers
- Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments.
We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order.
We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z) - Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast as finite-time error.
Our analysis provides for faster convergence step-sizes for choosing step-sizes.
arXiv Detail & Related papers (2022-09-06T15:04:57Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Momentum-Based Policy Gradient with Second-Order Information [40.51117836892182]
We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into gradient descent.
Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process.
Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.
arXiv Detail & Related papers (2022-05-17T11:56:50Z) - Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - Byzantine-Resilient Non-Convex Stochastic Gradient Descent [61.6382287971982]
adversary-resilient distributed optimization, in which.
machines can independently compute gradients, and cooperate.
Our algorithm is based on a new concentration technique, and its sample complexity.
It is very practical: it improves upon the performance of all prior methods when no.
setting machines are present.
arXiv Detail & Related papers (2020-12-28T17:19:32Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.