Adaptive Temporal Difference Learning with Linear Function Approximation
- URL: http://arxiv.org/abs/2002.08537v2
- Date: Mon, 11 Oct 2021 03:30:04 GMT
- Title: Adaptive Temporal Difference Learning with Linear Function Approximation
- Authors: Tao Sun, Han Shen, Tianyi Chen, Dongsheng Li
- Abstract summary: This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning.
We develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation.
We evaluate the performance of AdaTD(0) and AdaTD($lambda$) on several standard reinforcement learning tasks.
- Score: 29.741034258674205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper revisits the temporal difference (TD) learning algorithm for the
policy evaluation tasks in reinforcement learning. Typically, the performance
of TD(0) and TD($\lambda$) is very sensitive to the choice of stepsizes.
Oftentimes, TD(0) suffers from slow convergence. Motivated by the tight link
between the TD(0) learning algorithm and the stochastic gradient methods, we
develop a provably convergent adaptive projected variant of the TD(0) learning
algorithm with linear function approximation that we term AdaTD(0). In contrast
to the TD(0), AdaTD(0) is robust or less sensitive to the choice of stepsizes.
Analytically, we establish that to reach an $\epsilon$ accuracy, the number of
iterations needed is
$\tilde{O}(\epsilon^{-2}\ln^4\frac{1}{\epsilon}/\ln^4\frac{1}{\rho})$ in the
general case, where $\rho$ represents the speed of the underlying Markov chain
converges to the stationary distribution. This implies that the iteration
complexity of AdaTD(0) is no worse than that of TD(0) in the worst case. When
the stochastic semi-gradients are sparse, we provide theoretical acceleration
of AdaTD(0). Going beyond TD(0), we develop an adaptive variant of
TD($\lambda$), which is referred to as AdaTD($\lambda$). Empirically, we
evaluate the performance of AdaTD(0) and AdaTD($\lambda$) on several standard
reinforcement learning tasks, which demonstrate the effectiveness of our new
approaches.
Related papers
- Statistical Efficiency of Distributional Temporal Difference Learning [24.03281329962804]
We analyze the finite-sample performance of distributional temporal difference learning (CTD) and quantile temporal difference learning (QTD)
For a $gamma$-discounted infinite-horizon decision process, we show that for NTD we need $tildeOleft(frac1varepsilon2p (1-gamma)2pright)$ to achieve an $varepsilon$-optimal estimator with high probability.
We establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.
arXiv Detail & Related papers (2024-03-09T06:19:53Z) - Distributed TD(0) with Almost No Communication [15.321579527891457]
We provide a new non-asymptotic analysis of distributed temporal difference learning with linear function approximation.
We demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0).
arXiv Detail & Related papers (2023-05-25T17:00:46Z) - Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments.
We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order.
We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z) - Finite time analysis of temporal difference learning with linear
function approximation: Tail averaging and regularisation [44.27439128304058]
We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging.
We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice.
arXiv Detail & Related papers (2022-10-12T04:37:54Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning
Method [49.93717224277131]
We propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period.
We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity to bes.
arXiv Detail & Related papers (2021-10-13T17:40:12Z) - Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD) [0.0]
Predictor-Corrector Temporal Difference (PCTD) is what I call the translated time Reinforcement(RL) algorithm from the theory of discrete time ODE.
I propose a new class of TD learning algorithms.
The parameter being approximated has a guaranteed order of magnitude reduction in the Taylor Series error of the solution to the ODE.
arXiv Detail & Related papers (2021-04-15T18:54:16Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - Asynchronous Advantage Actor Critic: Non-asymptotic Analysis and Linear
Speedup [56.27526702716774]
This paper revisits the A3C algorithm with TD(0) for the critic update, termed A3C-TD(0), with provable convergence guarantees.
Under i.i.d. sampling, A3C-TD(0) obtains sample complexity of $mathcalO(epsilon-2.5/N)$ per worker to achieve $epsilon$ accuracy, where $N$ is the number of workers.
Compared to the best-known sample complexity of $mathcalO(epsilon-2.5/N)$ for two
arXiv Detail & Related papers (2020-12-31T09:07:09Z) - Reducing Sampling Error in Batch Temporal Difference Learning [42.30708351947417]
Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning.
This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data.
arXiv Detail & Related papers (2020-08-15T15:30:06Z) - Reanalysis of Variance Reduced Temporal Difference Learning [57.150444843282]
A variance reduced TD (VRTD) algorithm was proposed by Korda and La, which applies the variance reduction technique directly to the online TD learning with Markovian samples.
We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate.
arXiv Detail & Related papers (2020-01-07T05:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.