Distributed stochastic optimization with large delays
- URL: http://arxiv.org/abs/2107.02919v1
- Date: Tue, 6 Jul 2021 21:59:49 GMT
- Title: Distributed stochastic optimization with large delays
- Authors: Zhengyuan Zhou and Panayotis Mertikopoulos and Nicholas Bambos and
Peter W. Glynn and Yinyu Ye
- Abstract summary: One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
- Score: 59.95552973784946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the most widely used methods for solving large-scale stochastic
optimization problems is distributed asynchronous stochastic gradient descent
(DASGD), a family of algorithms that result from parallelizing stochastic
gradient descent on distributed computing architectures (possibly)
asychronously. However, a key obstacle in the efficient implementation of DASGD
is the issue of delays: when a computing node contributes a gradient update,
the global model parameter may have already been updated by other nodes several
times over, thereby rendering this gradient information stale. These delays can
quickly add up if the computational throughput of a node is saturated, so the
convergence of DASGD may be compromised in the presence of large delays. Our
first contribution is that, by carefully tuning the algorithm's step-size,
convergence to the critical set is still achieved in mean square, even if the
delays grow unbounded at a polynomial rate. We also establish finer results in
a broad class of structured optimization problems (called variationally
coherent), where we show that DASGD converges to a global optimum with
probability $1$ under the same delay assumptions. Together, these results
contribute to the broad landscape of large-scale non-convex stochastic
optimization by offering state-of-the-art theoretical guarantees and providing
insights for algorithm design.
Related papers
- Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Towards Understanding the Generalizability of Delayed Stochastic
Gradient Descent [63.43247232708004]
A gradient descent performed in an asynchronous manner plays a crucial role in training large-scale machine learning models.
Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization.
Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm.
arXiv Detail & Related papers (2023-08-18T10:00:27Z) - Min-Max Optimization under Delays [26.830212508878162]
Delays and asynchrony are inevitable in large-scale machine-learning problems.
No analogous theory is available for min-max optimization.
We show that even small delays can cause prominent algorithms like Extra-gradient to diverge.
arXiv Detail & Related papers (2023-07-13T16:39:01Z) - Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments.
We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order.
We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z) - On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade.
Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad)
We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z) - Guided parallelized stochastic gradient descent for delay compensation [0.0]
gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models.
With the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its natural behavior of sequential optimization of the error function.
This has led to the development of parallel SGD algorithms, such as asynchronous SGD (ASGD) and synchronous SGD (SSGD) to train deep neural networks.
arXiv Detail & Related papers (2021-01-17T23:12:40Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Stochastic Gradient Langevin with Delayed Gradients [29.6870062491741]
We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation.
We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation, suggesting significant potential for speedup in wall clock time.
arXiv Detail & Related papers (2020-06-12T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.