Debiasing Meta-Gradient Reinforcement Learning by Learning the Outer
Value Function
- URL: http://arxiv.org/abs/2211.10550v1
- Date: Sat, 19 Nov 2022 00:59:20 GMT
- Title: Debiasing Meta-Gradient Reinforcement Learning by Learning the Outer
Value Function
- Authors: Cl\'ement Bonnet, Laurence Midgley, Alexandre Laterre
- Abstract summary: We identify a bias in the meta-gradient of current meta-gradient RL approaches.
This bias comes from using the critic that is trained using the meta-learned discount factor for the advantage estimation in the outer objective.
Because the meta-learned discount factor is typically lower than the one used in the outer objective, the resulting bias can cause the meta-gradient to favor myopic policies.
- Score: 69.59204851882643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Meta-gradient Reinforcement Learning (RL) allows agents to self-tune their
hyper-parameters in an online fashion during training. In this paper, we
identify a bias in the meta-gradient of current meta-gradient RL approaches.
This bias comes from using the critic that is trained using the meta-learned
discount factor for the advantage estimation in the outer objective which
requires a different discount factor. Because the meta-learned discount factor
is typically lower than the one used in the outer objective, the resulting bias
can cause the meta-gradient to favor myopic policies. We propose a simple
solution to this issue: we eliminate this bias by using an alternative,
\emph{outer} value function in the estimation of the outer loss. To obtain this
outer value function we add a second head to the critic network and train it
alongside the classic critic, using the outer loss discount factor. On an
illustrative toy problem, we show that the bias can cause catastrophic failure
of current meta-gradient RL approaches, and show that our proposed solution
fixes it. We then apply our method to a more complex environment and
demonstrate that fixing the meta-gradient bias can significantly improve
performance.
Related papers
- Marginal Debiased Network for Fair Visual Recognition [59.05212866862219]
We propose a novel marginal debiased network (MDN) to learn debiased representations.
Our MDN can achieve a remarkable performance on under-represented samples.
arXiv Detail & Related papers (2024-01-04T08:57:09Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Train Hard, Fight Easy: Robust Meta Reinforcement Learning [78.16589993684698]
A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients.
Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty.
In this work, we define a robust MRL objective with a controlled level.
The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML)
arXiv Detail & Related papers (2023-01-26T14:54:39Z) - An Investigation of the Bias-Variance Tradeoff in Meta-Gradients [53.28925387487846]
Hessian estimation always adds bias and can also add variance to meta-gradient estimation.
We study the bias and variance tradeoff arising from truncated backpropagation and sampling correction.
arXiv Detail & Related papers (2022-09-22T20:33:05Z) - Value Gradient weighted Model-Based Reinforcement Learning [28.366157882991565]
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies.
VaGraM is a novel method for value-aware model learning.
arXiv Detail & Related papers (2022-04-04T13:28:31Z) - A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning [16.824515577815696]
Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures.
We show that existing meta-gradient estimators adopted by GMRL are actually text-bfbiased.
We conduct experiments on Iterated Prisoner's Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.
arXiv Detail & Related papers (2021-12-31T11:56:40Z) - A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories.
We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem.
The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.