Related papers: Distributed stochastic optimization with large delays

Distributed stochastic optimization with large delays

URL: http://arxiv.org/abs/2107.02919v1
Date: Tue, 6 Jul 2021 21:59:49 GMT
Title: Distributed stochastic optimization with large delays
Authors: Zhengyuan Zhou and Panayotis Mertikopoulos and Nicholas Bambos and Peter W. Glynn and Yinyu Ye
Abstract summary: One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD) We show that DASGD converges to a global optimal implementation model under same delay assumptions.
Score: 59.95552973784946
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One of the most widely used methods for solving large-scale stochastic optimization problems is distributed asynchronous stochastic gradient descent (DASGD), a family of algorithms that result from parallelizing stochastic gradient descent on distributed computing architectures (possibly) asychronously. However, a key obstacle in the efficient implementation of DASGD is the issue of delays: when a computing node contributes a gradient update, the global model parameter may have already been updated by other nodes several times over, thereby rendering this gradient information stale. These delays can quickly add up if the computational throughput of a node is saturated, so the convergence of DASGD may be compromised in the presence of large delays. Our first contribution is that, by carefully tuning the algorithm's step-size, convergence to the critical set is still achieved in mean square, even if the delays grow unbounded at a polynomial rate. We also establish finer results in a broad class of structured optimization problems (called variationally coherent), where we show that DASGD converges to a global optimum with probability $1$ under the same delay assumptions. Together, these results contribute to the broad landscape of large-scale non-convex stochastic optimization by offering state-of-the-art theoretical guarantees and providing insights for algorithm design.

Related papers

Distributionally Robust Optimization via Iterative Algorithms in Continuous Probability Spaces [6.992239210938067]
We consider a minimax problem motivated by distributionally robust optimization (DRO) when the worst-case distribution is continuous. Recent research has explored learning the worst-case distribution using neural network-based generative networks. This paper bridges this theoretical challenge by presenting an iterative algorithm to solve such a minimax problem.
arXiv Detail & Related papers (2024-12-29T19:31:23Z)
MindFlayer SGD: Efficient Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times [49.1574468325115]
We investigate the problem of minimizing the expectation of smooth non functions in a setting with multiple parallel workers that are able to compute optimal gradients.<n>A challenge in this context is the presence of arbitrarily heterogeneous and distributed compute times.<n>We introduce MindFlayer SGD, a novel parallel SGD method specifically designed to handle this gap.
arXiv Detail & Related papers (2024-10-05T21:11:32Z)
Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z)
Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z)
Towards Understanding the Generalizability of Delayed Stochastic Gradient Descent [63.43247232708004]
A gradient descent performed in an asynchronous manner plays a crucial role in training large-scale machine learning models. Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization. Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm.
arXiv Detail & Related papers (2023-08-18T10:00:27Z)
Min-Max Optimization under Delays [26.830212508878162]
Delays and asynchrony are inevitable in large-scale machine-learning problems. No analogous theory is available for min-max optimization. We show that even small delays can cause prominent algorithms like Extra-gradient to diverge.
arXiv Detail & Related papers (2023-07-13T16:39:01Z)
Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments. We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order. We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z)
On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade. Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad) We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z)
Guided parallelized stochastic gradient descent for delay compensation [0.0]
gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models. With the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its natural behavior of sequential optimization of the error function. This has led to the development of parallel SGD algorithms, such as asynchronous SGD (ASGD) and synchronous SGD (SSGD) to train deep neural networks.
arXiv Detail & Related papers (2021-01-17T23:12:40Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
Stochastic Gradient Langevin with Delayed Gradients [29.6870062491741]
We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation. We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation, suggesting significant potential for speedup in wall clock time.
arXiv Detail & Related papers (2020-06-12T17:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.