Bridging the Gap Between Target Networks and Functional Regularization
- URL: http://arxiv.org/abs/2210.12282v2
- Date: Wed, 3 Jan 2024 17:02:21 GMT
- Title: Bridging the Gap Between Target Networks and Functional Regularization
- Authors: Alexandre Piche and Valentin Thomas and Joseph Marino and Rafael
Pardinas and Gian Maria Marconi and Christopher Pal and Mohammad Emtiyaz Khan
- Abstract summary: We propose an explicit Functional Regularization that is a convex regularizer in function space and can easily be tuned.
We analyze the convergence of our method theoretically and empirically demonstrate that replacing Target Networks with the more theoretically grounded Functional Regularization approach leads to better sample efficiency and performance improvements.
- Score: 61.051716530459586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bootstrapping is behind much of the successes of Deep Reinforcement Learning.
However, learning the value function via bootstrapping often leads to unstable
training due to fast-changing target values. Target Networks are employed to
stabilize training by using an additional set of lagging parameters to estimate
the target values. Despite the popularity of Target Networks, their effect on
the optimization is still misunderstood. In this work, we show that they act as
an implicit regularizer. This regularizer has disadvantages such as being
inflexible and non convex. To overcome these issues, we propose an explicit
Functional Regularization that is a convex regularizer in function space and
can easily be tuned. We analyze the convergence of our method theoretically and
empirically demonstrate that replacing Target Networks with the more
theoretically grounded Functional Regularization approach leads to better
sample efficiency and performance improvements.
Related papers
- REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Why Target Networks Stabilise Temporal Difference Methods [38.35578010611503]
We show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed.
We conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update.
arXiv Detail & Related papers (2023-02-24T09:46:00Z) - KL Guided Domain Adaptation [88.19298405363452]
Domain adaptation is an important problem and often needed for real-world applications.
A common approach in the domain adaptation literature is to learn a representation of the input that has the same distributions over the source and the target domain.
We show that with a probabilistic representation network, the KL term can be estimated efficiently via minibatch samples.
arXiv Detail & Related papers (2021-06-14T22:24:23Z) - Bridging the Gap Between Target Networks and Functional Regularization [61.051716530459586]
We show that Target Networks act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages.
We propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space.
Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement.
arXiv Detail & Related papers (2021-06-04T17:21:07Z) - Breaking the Deadly Triad with a Target Network [80.82586530205776]
The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously.
We provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
arXiv Detail & Related papers (2021-01-21T21:50:10Z) - Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits.
We show that this discrepancy is due to the emphaction-stability of their objectives.
In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.