Gradient Descent Temporal Difference-difference Learning
- URL: http://arxiv.org/abs/2209.04624v1
- Date: Sat, 10 Sep 2022 08:55:20 GMT
- Title: Gradient Descent Temporal Difference-difference Learning
- Authors: Rong J.B. Zhu and James M. Murray
- Abstract summary: We propose descent temporal difference-difference (Gradient-DD) learning in order to improve GTD2, a GTD algorithm.
We study the model empirically on the random walk task, the Boyan-chain task, and the Baird's off-policy counterexample.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy algorithms, in which a behavior policy differs from the target
policy and is used to gain experience for learning, have proven to be of great
practical value in reinforcement learning. However, even for simple convex
problems such as linear value function approximation, these algorithms are not
guaranteed to be stable. To address this, alternative algorithms that are
provably convergent in such cases have been introduced, the most well known
being gradient descent temporal difference (GTD) learning. This algorithm and
others like it, however, tend to converge much more slowly than conventional
temporal difference learning. In this paper we propose gradient descent
temporal difference-difference (Gradient-DD) learning in order to improve GTD2,
a GTD algorithm, by introducing second-order differences in successive
parameter updates. We investigate this algorithm in the framework of linear
value function approximation, theoretically proving its convergence by applying
the theory of stochastic approximation. %analytically showing its improvement
over GTD2. Studying the model empirically on the random walk task, the
Boyan-chain task, and the Baird's off-policy counterexample, we find
substantial improvement over GTD2 and, in several cases, better performance
even than conventional TD learning.
Related papers
- Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Backstepping Temporal Difference Learning [3.5823366350053325]
We propose a new convergent algorithm for off-policy TD-learning.
Our method relies on the backstepping technique, which is widely used in nonlinear control theory.
convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.
arXiv Detail & Related papers (2023-02-20T10:06:49Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling.
Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates.
We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z) - Parameter-free Gradient Temporal Difference Learning [3.553493344868414]
We develop gradient-based temporal difference algorithms for reinforcement learning.
Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $log$ factors.
Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
arXiv Detail & Related papers (2021-05-10T06:07:05Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Proximal Gradient Temporal Difference Learning: Stable Reinforcement
Learning with Polynomial Sample Complexity [40.73281056650241]
We introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true gradient temporal difference learning algorithms.
We show how gradient TD reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function.
arXiv Detail & Related papers (2020-06-06T21:04:21Z) - Finite-Sample Analysis of Proximal Gradient TD Algorithms [43.035055641190105]
We analyze the convergence rate of the gradient temporal difference learning (GTD) family of algorithms.
Two revised algorithms are also proposed, namely projected GTD2 and GTD2-MP.
The results of our theoretical analysis show that the GTD family of algorithms are indeed comparable to the existing LSTD methods in off-policy learning scenarios.
arXiv Detail & Related papers (2020-06-06T20:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.