Related papers: Adaptive Temporal Difference Learning with Linear Function Approximation

Adaptive Temporal Difference Learning with Linear Function Approximation

URL: http://arxiv.org/abs/2002.08537v2
Date: Mon, 11 Oct 2021 03:30:04 GMT
Title: Adaptive Temporal Difference Learning with Linear Function Approximation
Authors: Tao Sun, Han Shen, Tianyi Chen, Dongsheng Li
Abstract summary: This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. We develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation. We evaluate the performance of AdaTD(0) and AdaTD($lambda$) on several standard reinforcement learning tasks.
Score: 29.741034258674205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. Typically, the performance of TD(0) and TD($\lambda$) is very sensitive to the choice of stepsizes. Oftentimes, TD(0) suffers from slow convergence. Motivated by the tight link between the TD(0) learning algorithm and the stochastic gradient methods, we develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation that we term AdaTD(0). In contrast to the TD(0), AdaTD(0) is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an $\epsilon$ accuracy, the number of iterations needed is $\tilde{O}(\epsilon^{-2}\ln^4\frac{1}{\epsilon}/\ln^4\frac{1}{\rho})$ in the general case, where $\rho$ represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD(0) is no worse than that of TD(0) in the worst case. When the stochastic semi-gradients are sparse, we provide theoretical acceleration of AdaTD(0). Going beyond TD(0), we develop an adaptive variant of TD($\lambda$), which is referred to as AdaTD($\lambda$). Empirically, we evaluate the performance of AdaTD(0) and AdaTD($\lambda$) on several standard reinforcement learning tasks, which demonstrate the effectiveness of our new approaches.

Related papers

Accelerating Multi-Task Temporal Difference Learning under Low-Rank Representation [12.732028509861829]
We study policy evaluation problems in multi-task reinforcement learning (RL) under a low-rank representation setting. We propose a new variant of TD learning method, where we integrate the so-called truncated singular value decomposition step into the update of TD learning. Our empirical results show that the proposed method significantly outperforms the classic TD learning, where the performance gap increases as the rank $r$ decreases.
arXiv Detail & Related papers (2025-03-03T20:07:45Z)
Statistical Efficiency of Distributional Temporal Difference Learning [24.03281329962804]
We analyze the finite-sample performance of distributional temporal difference learning (CTD) and quantile temporal difference learning (QTD) For a $gamma$-discounted infinite-horizon decision process, we show that for NTD we need $tildeOleft(frac1varepsilon2p (1-gamma)2pright)$ to achieve an $varepsilon$-optimal estimator with high probability. We establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.
arXiv Detail & Related papers (2024-03-09T06:19:53Z)
Distributed TD(0) with Almost No Communication [15.321579527891457]
We provide a new non-asymptotic analysis of distributed temporal difference learning with linear function approximation. We demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0).
arXiv Detail & Related papers (2023-05-25T17:00:46Z)
Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments. We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order. We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z)
Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation [44.27439128304058]
We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice.
arXiv Detail & Related papers (2022-10-12T04:37:54Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z)
PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method [49.93717224277131]
We propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period. We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity to bes.
arXiv Detail & Related papers (2021-10-13T17:40:12Z)
Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD) [0.0]
Predictor-Corrector Temporal Difference (PCTD) is what I call the translated time Reinforcement(RL) algorithm from the theory of discrete time ODE. I propose a new class of TD learning algorithms. The parameter being approximated has a guaranteed order of magnitude reduction in the Taylor Series error of the solution to the ODE.
arXiv Detail & Related papers (2021-04-15T18:54:16Z)
A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems. MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z)
Asynchronous Advantage Actor Critic: Non-asymptotic Analysis and Linear Speedup [56.27526702716774]
This paper revisits the A3C algorithm with TD(0) for the critic update, termed A3C-TD(0), with provable convergence guarantees. Under i.i.d. sampling, A3C-TD(0) obtains sample complexity of $mathcalO(epsilon-2.5/N)$ per worker to achieve $epsilon$ accuracy, where $N$ is the number of workers. Compared to the best-known sample complexity of $mathcalO(epsilon-2.5/N)$ for two
arXiv Detail & Related papers (2020-12-31T09:07:09Z)
Reducing Sampling Error in Batch Temporal Difference Learning [42.30708351947417]
Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data.
arXiv Detail & Related papers (2020-08-15T15:30:06Z)
Reanalysis of Variance Reduced Temporal Difference Learning [57.150444843282]
A variance reduced TD (VRTD) algorithm was proposed by Korda and La, which applies the variance reduction technique directly to the online TD learning with Markovian samples. We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate.
arXiv Detail & Related papers (2020-01-07T05:32:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.