Why Target Networks Stabilise Temporal Difference Methods
- URL: http://arxiv.org/abs/2302.12537v3
- Date: Fri, 11 Aug 2023 20:28:56 GMT
- Title: Why Target Networks Stabilise Temporal Difference Methods
- Authors: Mattie Fellows, Matthew J. A. Smith, Shimon Whiteson
- Abstract summary: We show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed.
We conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update.
- Score: 38.35578010611503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integral to recent successes in deep reinforcement learning has been a class
of temporal difference methods that use infrequently updated target values for
policy evaluation in a Markov Decision Process. Yet a complete theoretical
explanation for the effectiveness of target networks remains elusive. In this
work, we provide an analysis of this popular class of algorithms, to finally
answer the question: `why do target networks stabilise TD learning'? To do so,
we formalise the notion of a partially fitted policy evaluation method, which
describes the use of target networks and bridges the gap between fitted methods
and semigradient temporal difference algorithms. Using this framework we are
able to uniquely characterise the so-called deadly triad - the use of TD
updates with (nonlinear) function approximation and off-policy data - which
often leads to nonconvergent algorithms. This insight leads us to conclude that
the use of target networks can mitigate the effects of poor conditioning in the
Jacobian of the TD update. Instead, we show that under mild regularity
conditions and a well tuned target network update frequency, convergence can be
guaranteed even in the extremely challenging off-policy sampling and nonlinear
function approximation setting.
Related papers
- Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Bridging the Gap Between Target Networks and Functional Regularization [61.051716530459586]
We propose an explicit Functional Regularization that is a convex regularizer in function space and can easily be tuned.
We analyze the convergence of our method theoretically and empirically demonstrate that replacing Target Networks with the more theoretically grounded Functional Regularization approach leads to better sample efficiency and performance improvements.
arXiv Detail & Related papers (2022-10-21T22:27:07Z) - Bridging the Gap Between Target Networks and Functional Regularization [61.051716530459586]
We show that Target Networks act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages.
We propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space.
Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement.
arXiv Detail & Related papers (2021-06-04T17:21:07Z) - Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via
Online High-Confidence Change-Point Detection [7.685002911021767]
We introduce an algorithm that efficiently learns policies in non-stationary environments.
It analyzes a possibly infinite stream of data and computes, in real-time, high-confidence change-point detection statistics.
We show that (i) this algorithm minimizes the delay until unforeseen changes to a context are detected, thereby allowing for rapid responses.
arXiv Detail & Related papers (2021-05-20T01:57:52Z) - Breaking the Deadly Triad with a Target Network [80.82586530205776]
The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously.
We provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
arXiv Detail & Related papers (2021-01-21T21:50:10Z) - Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits.
We show that this discrepancy is due to the emphaction-stability of their objectives.
In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.