Related papers: Temporal Difference Learning as Gradient Splitting

Temporal Difference Learning as Gradient Splitting

URL: http://arxiv.org/abs/2010.14657v1
Date: Tue, 27 Oct 2020 22:50:39 GMT
Title: Temporal Difference Learning as Gradient Splitting
Authors: Rui Liu and Alex Olshevsky
Abstract summary: We show that convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-gamma)$ only multiplies anally negligible term.
Score: 15.321579527891457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We give a new interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a new, fuller explanation of why temporal difference works, our interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-\gamma)$ in front of the bound, with $\gamma$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-\gamma)$ only multiplies an asymptotically negligible term.

Related papers

Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process. For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature. We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z)
Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients [0.196629787330046]
We show that incorporating partial second-order information of the objective function can dramatically improve robustness to mini-batch size of variance-reduced gradient methods. We demonstrate this phenomenon on a prototypical Newton ($textttMb-SVRN$) algorithm.
arXiv Detail & Related papers (2024-04-23T05:45:52Z)
Statistical Efficiency of Distributional Temporal Difference Learning [24.03281329962804]
We analyze the finite-sample performance of distributional temporal difference learning (CTD) and quantile temporal difference learning (QTD) For a $gamma$-discounted infinite-horizon decision process, we show that for NTD we need $tildeOleft(frac1varepsilon2p (1-gamma)2pright)$ to achieve an $varepsilon$-optimal estimator with high probability. We establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.
arXiv Detail & Related papers (2024-03-09T06:19:53Z)
Distributed TD(0) with Almost No Communication [15.321579527891457]
We provide a new non-asymptotic analysis of distributed temporal difference learning with linear function approximation. We demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0).
arXiv Detail & Related papers (2023-05-25T17:00:46Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z)
Momentum-Based Policy Gradient with Second-Order Information [40.51117836892182]
We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into gradient descent. Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process. Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.
arXiv Detail & Related papers (2022-05-17T11:56:50Z)
Scalable Frank-Wolfe on Generalized Self-concordant Functions via Simple Steps [66.88729048402082]
Generalized self-concordance is a key property present in the objective function of many learning problems. We show improved convergence rates for various common cases, e.g., when the feasible region under consideration is uniformly convex or polyhedral.
arXiv Detail & Related papers (2021-05-28T15:26:36Z)
Parameter-free Gradient Temporal Difference Learning [3.553493344868414]
We develop gradient-based temporal difference algorithms for reinforcement learning. Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $log$ factors. Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
arXiv Detail & Related papers (2021-05-10T06:07:05Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback. We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode. We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z)
Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP) We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z)
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD) We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.