Backstepping Temporal Difference Learning
- URL: http://arxiv.org/abs/2302.09875v1
- Date: Mon, 20 Feb 2023 10:06:49 GMT
- Title: Backstepping Temporal Difference Learning
- Authors: Han-Dong Lim and Donghwan Lee
- Abstract summary: We propose a new convergent algorithm for off-policy TD-learning.
Our method relies on the backstepping technique, which is widely used in nonlinear control theory.
convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.
- Score: 3.5823366350053325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy learning ability is an important feature of reinforcement learning
(RL) for practical applications. However, even one of the most elementary RL
algorithms, temporal-difference (TD) learning, is known to suffer form
divergence issue when the off-policy scheme is used together with linear
function approximation. To overcome the divergent behavior, several off-policy
TD-learning algorithms, including gradient-TD learning (GTD), and TD-learning
with correction (TDC), have been developed until now. In this work, we provide
a unified view of such algorithms from a purely control-theoretic perspective,
and propose a new convergent algorithm. Our method relies on the backstepping
technique, which is widely used in nonlinear control theory.
Finally, convergence of the proposed algorithm is experimentally verified in
environments where the standard TD-learning is known to be unstable.
Related papers
- PID Accelerated Temporal Difference Algorithms [7.634360142922117]
Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks.
PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory.
We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning.
arXiv Detail & Related papers (2024-07-11T18:23:46Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Gradient Descent Temporal Difference-difference Learning [0.0]
We propose descent temporal difference-difference (Gradient-DD) learning in order to improve GTD2, a GTD algorithm.
We study the model empirically on the random walk task, the Boyan-chain task, and the Baird's off-policy counterexample.
arXiv Detail & Related papers (2022-09-10T08:55:20Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - Online Attentive Kernel-Based Temporal Difference Learning [13.94346725929798]
Online Reinforcement Learning (RL) has been receiving increasing attention due to its fast learning capability and improving data efficiency.
Online RL often suffers from complex Value Function Approximation (VFA) and catastrophic interference.
We propose an Online Attentive Kernel-Based Temporal Difference (OAKTD) algorithm using two-timescale optimization.
arXiv Detail & Related papers (2022-01-22T14:47:10Z) - Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling.
Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates.
We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z) - Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD) [0.0]
Predictor-Corrector Temporal Difference (PCTD) is what I call the translated time Reinforcement(RL) algorithm from the theory of discrete time ODE.
I propose a new class of TD learning algorithms.
The parameter being approximated has a guaranteed order of magnitude reduction in the Taylor Series error of the solution to the ODE.
arXiv Detail & Related papers (2021-04-15T18:54:16Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Evolving Reinforcement Learning Algorithms [186.62294652057062]
We propose a method for meta-learning reinforcement learning algorithms.
The learned algorithms are domain-agnostic and can generalize to new environments not seen during training.
We highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games.
arXiv Detail & Related papers (2021-01-08T18:55:07Z) - Discovering Reinforcement Learning Algorithms [53.72358280495428]
Reinforcement learning algorithms update an agent's parameters according to one of several possible rules.
This paper introduces a new meta-learning approach that discovers an entire update rule.
It includes both 'what to predict' (e.g. value functions) and 'how to learn from it' by interacting with a set of environments.
arXiv Detail & Related papers (2020-07-17T07:38:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.