Stochastic Gradient Langevin with Delayed Gradients
- URL: http://arxiv.org/abs/2006.07362v1
- Date: Fri, 12 Jun 2020 17:51:30 GMT
- Title: Stochastic Gradient Langevin with Delayed Gradients
- Authors: Vyacheslav Kungurtsev, Bapi Chatterjee, Dan Alistarh
- Abstract summary: We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation.
We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation, suggesting significant potential for speedup in wall clock time.
- Score: 29.6870062491741
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees with
regards to convergence in measure for sampling log-concave posterior
distributions by adding noise to stochastic gradient iterates. Given the size
of many practical problems, parallelizing across several asynchronously running
processors is a popular strategy for reducing the end-to-end computation time
of stochastic optimization algorithms. In this paper, we are the first to
investigate the effect of asynchronous computation, in particular, the
evaluation of stochastic Langevin gradients at delayed iterates, on the
convergence in measure. For this, we exploit recent results modeling Langevin
dynamics as solving a convex optimization problem on the space of measures. We
show that the rate of convergence in measure is not significantly affected by
the error caused by the delayed gradient information used for computation,
suggesting significant potential for speedup in wall clock time. We confirm our
theoretical results with numerical experiments on some practical problems.
Related papers
- Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling.
We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays.
We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z) - Towards Understanding the Generalizability of Delayed Stochastic
Gradient Descent [63.43247232708004]
A gradient descent performed in an asynchronous manner plays a crucial role in training large-scale machine learning models.
Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization.
Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm.
arXiv Detail & Related papers (2023-08-18T10:00:27Z) - Ordering for Non-Replacement SGD [7.11967773739707]
We seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm.
We develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions.
In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks.
arXiv Detail & Related papers (2023-06-28T00:46:58Z) - Reweighted Interacting Langevin Diffusions: an Accelerated Sampling
Methodfor Optimization [28.25662317591378]
We propose a new technique to accelerate sampling methods for solving difficult optimization problems.
Our method investigates the connection between posterior distribution sampling and optimization with Langevin dynamics.
arXiv Detail & Related papers (2023-01-30T03:48:20Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - A Continuous-time Stochastic Gradient Descent Method for Continuous Data [0.0]
We study a continuous-time variant of the gradient descent algorithm for optimization problems with continuous data.
We study multiple sampling patterns for the continuous data space and allow for data simulated or streamed at runtime.
We end with illustrating the applicability of the gradient process in a regression problem with noisy functional data, as well as in a physics-informed neural network.
arXiv Detail & Related papers (2021-12-07T15:09:24Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Stochastic Optimization under Distributional Drift [3.0229888038442922]
We provide non-asymptotic convergence guarantees for algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability.
We identify a low drift-to-noise regime in which the tracking efficiency of the gradient method benefits significantly from a step decay schedule.
arXiv Detail & Related papers (2021-08-16T21:57:39Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.