Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning
- URL: http://arxiv.org/abs/2205.12401v1
- Date: Tue, 24 May 2022 23:22:10 GMT
- Title: Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning
- Authors: Xinran Liang, Katherine Shu, Kimin Lee, Pieter Abbeel
- Abstract summary: We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
- Score: 88.34958680436552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conveying complex objectives to reinforcement learning (RL) agents often
requires meticulous reward engineering. Preference-based RL methods are able to
learn a more flexible reward model based on human preferences by actively
incorporating human feedback, i.e. teacher's preferences between two clips of
behaviors. However, poor feedback-efficiency still remains a problem in current
preference-based RL algorithms, as tailored human feedback is very expensive.
To handle this issue, previous methods have mainly focused on improving query
selection and policy initialization. At the same time, recent exploration
methods have proven to be a recipe for improving sample-efficiency in RL. We
present an exploration method specifically for preference-based RL algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based
on learned reward. Specifically, we utilize disagreement across ensemble of
learned reward models. Our intuition is that disagreement in learned reward
model reflects uncertainty in tailored human feedback and could be useful for
exploration. Our experiments show that exploration bonus from uncertainty in
learned reward improves both feedback- and sample-efficiency of
preference-based RL algorithms on complex robot manipulation tasks from
MetaWorld benchmarks, compared with other existing exploration methods that
measure the novelty of state visitation.
Related papers
- Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Data Driven Reward Initialization for Preference based Reinforcement
Learning [20.13307800821161]
Preference-based Reinforcement Learning (PbRL) methods utilize binary feedback from the human in the loop (HiL) over queried trajectory pairs to learn a reward model.
We investigate the issue of a high degree of variability in the reward models which are sensitive to random seeds of the experiment.
arXiv Detail & Related papers (2023-02-17T07:07:07Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z) - Self-Imitation Advantage Learning [43.8107780378031]
Self-imitation learning is a Reinforcement Learning method that encourages actions whose returns were higher than expected.
We propose a novel generalization of self-imitation learning for off-policy RL, based on a modification of the Bellman optimality operator.
arXiv Detail & Related papers (2020-12-22T13:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.