Unbiased Methods for Multi-Goal Reinforcement Learning
- URL: http://arxiv.org/abs/2106.08863v1
- Date: Wed, 16 Jun 2021 15:31:51 GMT
- Title: Unbiased Methods for Multi-Goal Reinforcement Learning
- Authors: L\'eonard Blier and Yann Ollivier
- Abstract summary: In multi-goal reinforcement learning, the reward for each goal is sparse, and located in a small neighborhood of the goal.
We show that Hindsight Experience Replay (HER) can converge to low-return policies by overestimating chancy outcomes.
We introduce unbiased deep Q-learning and actor-critic algorithms that can handle such infinitely sparse rewards, and test them in toy environments.
- Score: 13.807859854345834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In multi-goal reinforcement learning (RL) settings, the reward for each goal
is sparse, and located in a small neighborhood of the goal. In large dimension,
the probability of reaching a reward vanishes and the agent receives little
learning signal. Methods such as Hindsight Experience Replay (HER) tackle this
issue by also learning from realized but unplanned-for goals. But HER is known
to introduce bias, and can converge to low-return policies by overestimating
chancy outcomes. First, we vindicate HER by proving that it is actually
unbiased in deterministic environments, such as many optimal control settings.
Next, for stochastic environments in continuous spaces, we tackle sparse
rewards by directly taking the infinitely sparse reward limit. We fully
formalize the problem of multi-goal RL with infinitely sparse Dirac rewards at
each goal. We introduce unbiased deep Q-learning and actor-critic algorithms
that can handle such infinitely sparse rewards, and test them in toy
environments.
Related papers
- REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - $f$-Policy Gradients: A General Framework for Goal Conditioned RL using
$f$-Divergences [44.91973620442546]
This paper introduces a novel way to encourage exploration called $f$-Policy Gradients.
We show that $f$-PG has better performance compared to standard policy methods on a challenging gridworld.
arXiv Detail & Related papers (2023-10-10T17:07:05Z) - Rewarded soups: towards Pareto-optimal alignment by interpolating
weights fine-tuned on diverse rewards [101.7246658985579]
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data.
We propose embracing the heterogeneity of diverse rewards by following a multi-policy strategy.
We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks.
arXiv Detail & Related papers (2023-06-07T14:58:15Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - USHER: Unbiased Sampling for Hindsight Experience Replay [12.660090786323067]
Dealing with sparse rewards is a long-standing challenge in reinforcement learning (RL)
Hindsight Experience Replay (HER) addresses this problem by reusing failed trajectories for one goal as successful trajectories for another.
This strategy is known to result in a biased value function, as the update rule underestimates the likelihood of bad outcomes in a environment.
We propose anally unbiased importance-based algorithm to address this problem without sacrificing performance on deterministic environments.
arXiv Detail & Related papers (2022-07-03T20:25:06Z) - Hindsight Task Relabelling: Experience Replay for Sparse Reward Meta-RL [91.26538493552817]
We present a formulation of hindsight relabeling for meta-RL, which relabels experience during meta-training to enable learning to learn entirely using sparse reward.
We demonstrate the effectiveness of our approach on a suite of challenging sparse reward goal-reaching environments.
arXiv Detail & Related papers (2021-12-02T00:51:17Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - Learning One Representation to Optimize All Rewards [19.636676744015197]
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process.
It provides explicit near-optimal policies for any reward specified a posteriori.
This is a step towards learning controllable agents in arbitrary black-box environments.
arXiv Detail & Related papers (2021-03-14T15:00:08Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Learning Intrinsic Symbolic Rewards in Reinforcement Learning [7.101885582663675]
We present a method that discovers dense rewards in the form of low-dimensional symbolic trees.
We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks.
arXiv Detail & Related papers (2020-10-08T00:02:46Z) - Reinforcement Learning with Goal-Distance Gradient [1.370633147306388]
Reinforcement learning usually uses the feedback rewards of environmental to train agents.
Most of the current methods are difficult to get good performance in sparse reward or non-reward environments.
We propose a model-free method that does not rely on environmental rewards to solve the problem of sparse rewards in the general environment.
arXiv Detail & Related papers (2020-01-01T02:37:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.