Accelerating Optimization and Reinforcement Learning with
Quasi-Stochastic Approximation
- URL: http://arxiv.org/abs/2009.14431v2
- Date: Thu, 1 Oct 2020 04:39:19 GMT
- Title: Accelerating Optimization and Reinforcement Learning with
Quasi-Stochastic Approximation
- Authors: Shuhang Chen, Adithya Devraj, Andrey Bernstein and Sean Meyn
- Abstract summary: This paper sets out to extend convergence theory to quasi-stochastic approximations.
It is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.
- Score: 2.294014185517203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ODE method has been a workhorse for algorithm design and analysis since
the introduction of the stochastic approximation. It is now understood that
convergence theory amounts to establishing robustness of Euler approximations
for ODEs, while theory of rates of convergence requires finer analysis. This
paper sets out to extend this theory to quasi-stochastic approximation, based
on algorithms in which the "noise" is based on deterministic signals. The main
results are obtained under minimal assumptions: the usual Lipschitz conditions
for ODE vector fields, and it is assumed that there is a well defined
linearization near the optimal parameter $\theta^*$, with Hurwitz linearization
matrix $A^*$.
The main contributions are summarized as follows:
(i) If the algorithm gain is $a_t=g/(1+t)^\rho$ with $g>0$ and
$\rho\in(0,1)$, then the rate of convergence of the algorithm is $1/t^\rho$.
There is also a well defined "finite-$t$" approximation: \[
a_t^{-1}\{\Theta_t-\theta^*\}=\bar{Y}+\Xi^{\mathrm{I}}_t+o(1) \] where
$\bar{Y}\in\mathbb{R}^d$ is a vector identified in the paper, and
$\{\Xi^{\mathrm{I}}_t\}$ is bounded with zero temporal mean.
(ii) With gain $a_t = g/(1+t)$ the results are not as sharp: the rate of
convergence $1/t$ holds only if $I + g A^*$ is Hurwitz.
(iii) Based on the Ruppert-Polyak averaging of stochastic approximation, one
would expect that a convergence rate of $1/t$ can be obtained by averaging: \[
\Theta^{\text{RP}}_T=\frac{1}{T}\int_{0}^T \Theta_t\,dt \] where the estimates
$\{\Theta_t\}$ are obtained using the gain in (i). The preceding sharp bounds
imply that averaging results in $1/t$ convergence rate if and only if
$\bar{Y}=\sf 0$. This condition holds if the noise is additive, but appears to
fail in general.
(iv) The theory is illustrated with applications to gradient-free
optimization and policy gradient algorithms for reinforcement learning.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Revisiting Step-Size Assumptions in Stochastic Approximation [1.3654846342364308]
The paper revisits step-size selection in a general Markovian setting.
A major conclusion is that the choice of $rho =0$ or even $rho1/2$ is justified only in select settings.
arXiv Detail & Related papers (2024-05-28T05:11:05Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - Convergence of a Normal Map-based Prox-SGD Method under the KL
Inequality [0.0]
We present a novel map-based algorithm ($mathsfnorMtext-mathsfSGD$) for $symbol$k$ convergence problems.
arXiv Detail & Related papers (2023-05-10T01:12:11Z) - Breaking the Lower Bound with (Little) Structure: Acceleration in
Non-Convex Stochastic Optimization with Heavy-Tailed Noise [28.780192812703948]
We consider the optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime.
We show that one can achieve a faster rate than that dictated by the lower bound $Omega(Tfrac1-p3p-2)$ with only tiny bit of structure.
We also establish that it guarantees a high-probability convergence rate of $O(log(T/delta)Tfrac1-p3p-2)$ under a mild condition.
arXiv Detail & Related papers (2023-02-14T00:23:42Z) - Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD [38.221784575853796]
This work considers the problem of finding first-order stationary point of a non atau function with potentially constant smoothness using a gradient.
We develop a technique that allows us to prove $mathcalO(fracmathrmpolylog(T)sigmatT)$ convergence rates without assuming uniform bounds on the noise.
arXiv Detail & Related papers (2023-02-13T18:13:36Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning [3.8098187557917464]
The paper concerns the $d$-dimensional recursion approximation, $$theta_n+1= theta_n + alpha_n + 1 f(theta_n, Phi_n+1)
The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3)
An example is given where $f$ and $barf$ are linear in $theta$, and $Phi$ is a geometrical
arXiv Detail & Related papers (2021-10-27T13:38:25Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Sparse sketches with small inversion bias [79.77110958547695]
Inversion bias arises when averaging estimates of quantities that depend on the inverse covariance.
We develop a framework for analyzing inversion bias, based on our proposed concept of an $(epsilon,delta)$-unbiased estimator for random matrices.
We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, the estimator $(epsilon,delta)$-unbiased for $(Atop A)-1$ with a sketch of size $m=O(d+sqrt d/
arXiv Detail & Related papers (2020-11-21T01:33:15Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.