Breaking the Deadly Triad with a Target Network
- URL: http://arxiv.org/abs/2101.08862v9
- Date: Thu, 22 Jun 2023 19:41:36 GMT
- Title: Breaking the Deadly Triad with a Target Network
- Authors: Shangtong Zhang, Hengshuai Yao, Shimon Whiteson
- Abstract summary: The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously.
We provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
- Score: 80.82586530205776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The deadly triad refers to the instability of a reinforcement learning
algorithm when it employs off-policy learning, function approximation, and
bootstrapping simultaneously. In this paper, we investigate the target network
as a tool for breaking the deadly triad, providing theoretical support for the
conventional wisdom that a target network stabilizes training. We first propose
and analyze a novel target network update rule which augments the commonly used
Polyak-averaging style update with two projections. We then apply the target
network and ridge regularization in several divergent algorithms and show their
convergence to regularized TD fixed points. Those algorithms are off-policy
with linear function approximation and bootstrapping, spanning both policy
evaluation and control, as well as both discounted and average-reward settings.
In particular, we provide the first convergent linear $Q$-learning algorithms
under nonrestrictive and changing behavior policies without bi-level
optimization.
Related papers
- Joint Learning of Network Topology and Opinion Dynamics Based on Bandit
Algorithms [1.6912877206492036]
We study joint learning of network topology and a mixed opinion dynamics.
We propose a learning algorithm based on multi-armed bandit algorithms to address the problem.
arXiv Detail & Related papers (2023-06-25T21:53:13Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Why Target Networks Stabilise Temporal Difference Methods [38.35578010611503]
We show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed.
We conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update.
arXiv Detail & Related papers (2023-02-24T09:46:00Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Bridging the Gap Between Target Networks and Functional Regularization [61.051716530459586]
We propose an explicit Functional Regularization that is a convex regularizer in function space and can easily be tuned.
We analyze the convergence of our method theoretically and empirically demonstrate that replacing Target Networks with the more theoretically grounded Functional Regularization approach leads to better sample efficiency and performance improvements.
arXiv Detail & Related papers (2022-10-21T22:27:07Z) - Chaining Value Functions for Off-Policy Learning [22.54793586116019]
We discuss a novel family of off-policy prediction algorithms which are convergent by construction.
We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix.
Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
arXiv Detail & Related papers (2022-01-17T15:26:47Z) - Learning Optimal Antenna Tilt Control Policies: A Contextual Linear
Bandit Approach [65.27783264330711]
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity.
We devise algorithms learning optimal tilt control policies from existing data.
We show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
arXiv Detail & Related papers (2022-01-06T18:24:30Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - GRAC: Self-Guided and Self-Regularized Actor-Critic [24.268453994605512]
We propose a self-regularized TD-learning method to address divergence without requiring a target network.
We also propose a self-guided policy improvement method by combining policy-gradient with zero-order optimization.
This makes learning more robust to local noise in the Q function approximation and guides the updates of our actor network.
We evaluate GRAC on the suite of OpenAI gym tasks, achieving or outperforming state of the art in every environment tested.
arXiv Detail & Related papers (2020-09-18T17:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.