Accelerating Distributed Stochastic Optimization via Self-Repellent
Random Walks
- URL: http://arxiv.org/abs/2401.09665v1
- Date: Thu, 18 Jan 2024 00:50:37 GMT
- Title: Accelerating Distributed Stochastic Optimization via Self-Repellent
Random Walks
- Authors: Jie Hu and Vishwaraj Doshi and Do Young Eun
- Abstract summary: We study a family of distributed optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion.
We take a novel approach by replacing the standard linear Markovian token by one which follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW)
We prove that the optimization errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem.
- Score: 11.3631620309434
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study a family of distributed stochastic optimization algorithms where
gradients are sampled by a token traversing a network of agents in random-walk
fashion. Typically, these random-walks are chosen to be Markov chains that
asymptotically sample from a desired target distribution, and play a critical
role in the convergence of the optimization iterates. In this paper, we take a
novel approach by replacing the standard linear Markovian token by one which
follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW).
Defined for any given 'base' Markov chain, the SRRW, parameterized by a
positive scalar {\alpha}, is less likely to transition to states that were
highly visited in the past, thus the name. In the context of MCMC sampling on a
graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW
achieves O(1/{\alpha}) decrease in the asymptotic variance for sampling. We
propose the use of a 'generalized' version of the SRRW to drive token
algorithms for distributed stochastic optimization in the form of stochastic
approximation, termed SA-SRRW. We prove that the optimization iterate errors of
the resulting SA-SRRW converge to zero almost surely and prove a central limit
theorem, deriving the explicit form of the resulting asymptotic covariance
matrix corresponding to iterate errors. This asymptotic covariance is always
smaller than that of an algorithm driven by the base Markov chain and decreases
at rate O(1/{\alpha}^2) - the performance benefit of using SRRW thereby
amplified in the stochastic optimization context. Empirical results support our
theoretical findings.
Related papers
- Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD)
We use gradient new outliers as norm clipping chains.
We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z) - Adaptive Annealed Importance Sampling with Constant Rate Progress [68.8204255655161]
Annealed Importance Sampling (AIS) synthesizes weighted samples from an intractable distribution.
We propose the Constant Rate AIS algorithm and its efficient implementation for $alpha$-divergences.
arXiv Detail & Related papers (2023-06-27T08:15:28Z) - Self-Repellent Random Walks on General Graphs -- Achieving Minimal
Sampling Variance via Nonlinear Markov Chains [11.3631620309434]
We consider random walks on discrete state spaces, such as general undirected graphs, where the random walkers are designed to approximate a target quantity over the network topology via sampling and neighborhood exploration.
Given any Markov chain corresponding to a target probability distribution, we design a self-repellent random walk (SRRW) which is less likely to transition to nodes that were highly visited in the past, and more likely to transition to seldom visited nodes.
For a class of SRRWs parameterized by a positive real alpha, we prove that the empirical distribution of the process converges almost surely to the the target (
arXiv Detail & Related papers (2023-05-08T23:59:09Z) - Stochastic Gradient Descent under Markovian Sampling Schemes [3.04585143845864]
We study a variation of vanilla gradient descent where the only has access to a Markovian sampling scheme.
We focus on obtaining rates of convergence under the least restrictive assumptions possible on the underlying Markov chain.
arXiv Detail & Related papers (2023-02-28T09:18:00Z) - Optimization of Annealed Importance Sampling Hyperparameters [77.34726150561087]
Annealed Importance Sampling (AIS) is a popular algorithm used to estimates the intractable marginal likelihood of deep generative models.
We present a parameteric AIS process with flexible intermediary distributions and optimize the bridging distributions to use fewer number of steps for sampling.
We assess the performance of our optimized AIS for marginal likelihood estimation of deep generative models and compare it to other estimators.
arXiv Detail & Related papers (2022-09-27T07:58:25Z) - Coefficient-based Regularized Distribution Regression [4.21768682940933]
We consider the coefficient-based regularized distribution regression which aims to regress from probability measures to real-valued responses over a kernel reproducing Hilbert space (RKHS)
Asymptotic behaviors of the algorithm in different regularity ranges of the regression function are comprehensively studied.
We get the optimal rates under some mild conditions, which matches the one-stage sampled minimax optimal rate.
arXiv Detail & Related papers (2022-08-26T03:46:14Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - A Stochastic Newton Algorithm for Distributed Convex Optimization [62.20732134991661]
We analyze a Newton algorithm for homogeneous distributed convex optimization, where each machine can calculate gradients of the same population objective.
We show that our method can reduce the number, and frequency, of required communication rounds compared to existing methods without hurting performance.
arXiv Detail & Related papers (2021-10-07T17:51:10Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Spectral clustering under degree heterogeneity: a case for the random
walk Laplacian [83.79286663107845]
This paper shows that graph spectral embedding using the random walk Laplacian produces vector representations which are completely corrected for node degree.
In the special case of a degree-corrected block model, the embedding concentrates about K distinct points, representing communities.
arXiv Detail & Related papers (2021-05-03T16:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.