Efficiency Ordering of Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2209.07446v1
- Date: Thu, 15 Sep 2022 16:50:55 GMT
- Title: Efficiency Ordering of Stochastic Gradient Descent
- Authors: Jie Hu, Vishwaraj Doshi, Do Young Eun
- Abstract summary: We consider the gradient descent (SGD) algorithm driven by a general sampling sequence, including i.i.i.d noise and random walk on an arbitrary graph.
We employ the notion of efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo samplers.
- Score: 9.634481296779057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the stochastic gradient descent (SGD) algorithm driven by a
general stochastic sequence, including i.i.d noise and random walk on an
arbitrary graph, among others; and analyze it in the asymptotic sense.
Specifically, we employ the notion of `efficiency ordering', a well-analyzed
tool for comparing the performance of Markov Chain Monte Carlo (MCMC) samplers,
for SGD algorithms in the form of Loewner ordering of covariance matrices
associated with the scaled iterate errors in the long term. Using this
ordering, we show that input sequences that are more efficient for MCMC
sampling also lead to smaller covariance of the errors for SGD algorithms in
the limit. This also suggests that an arbitrarily weighted MSE of SGD iterates
in the limit becomes smaller when driven by more efficient chains. Our finding
is of particular interest in applications such as decentralized optimization
and swarm learning, where SGD is implemented in a random walk fashion on the
underlying communication graph for cost issues and/or data privacy. We
demonstrate how certain non-Markovian processes, for which typical mixing-time
based non-asymptotic bounds are intractable, can outperform their Markovian
counterparts in the sense of efficiency ordering for SGD. We show the utility
of our method by applying it to gradient descent with shuffling and mini-batch
gradient descent, reaffirming key results from existing literature under a
unified framework. Empirically, we also observe efficiency ordering for
variants of SGD such as accelerated SGD and Adam, open up the possibility of
extending our notion of efficiency ordering to a broader family of stochastic
optimization algorithms.
Related papers
- Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD)
We use gradient new outliers as norm clipping chains.
We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z) - Optimization of Annealed Importance Sampling Hyperparameters [77.34726150561087]
Annealed Importance Sampling (AIS) is a popular algorithm used to estimates the intractable marginal likelihood of deep generative models.
We present a parameteric AIS process with flexible intermediary distributions and optimize the bridging distributions to use fewer number of steps for sampling.
We assess the performance of our optimized AIS for marginal likelihood estimation of deep generative models and compare it to other estimators.
arXiv Detail & Related papers (2022-09-27T07:58:25Z) - Non Asymptotic Bounds for Optimization via Online Multiplicative
Stochastic Gradient Descent [0.0]
The gradient noise of Gradient Descent (SGD) is considered to play a key role in its properties.
We show that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties.
We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
arXiv Detail & Related papers (2021-12-14T02:25:43Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Stochastic Reweighted Gradient Descent [4.355567556995855]
We propose an importance-sampling-based algorithm we call SRG (stochastic reweighted gradient)
We pay particular attention to the time and memory overhead of our proposed method.
We present empirical results to support our findings.
arXiv Detail & Related papers (2021-03-23T04:09:43Z) - An adaptive Hessian approximated stochastic gradient MCMC method [12.93317525451798]
We present an adaptive Hessian approximated gradient MCMC method to incorporate local geometric information while sampling from the posterior.
We adopt a magnitude-based weight pruning method to enforce the sparsity of the network.
arXiv Detail & Related papers (2020-10-03T16:22:15Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via
Non-uniform Subsampling of Gradients [54.90670513852325]
We propose a non-uniform subsampling scheme to improve the sampling accuracy.
EWSG is designed so that a non-uniform gradient-MCMC method mimics the statistical behavior of a batch-gradient-MCMC method.
In our practical implementation of EWSG, the non-uniform subsampling is performed efficiently via a Metropolis-Hastings chain on the data index.
arXiv Detail & Related papers (2020-02-20T18:56:18Z) - Stochastic Approximate Gradient Descent via the Langevin Algorithm [11.36635610546803]
We introduce the approximate gradient descent (SAGD) as an alternative to the gradient descent for cases where unbiased gradients cannot be trivially obtained.
We show that SAGD performs well experimentally in popular statistical and machine learning problems such as the expectation-maximization algorithm and the variational autoencoders.
arXiv Detail & Related papers (2020-02-13T14:29:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.