Inverse Preference Learning: Preference-based RL without a Reward
Function
- URL: http://arxiv.org/abs/2305.15363v2
- Date: Fri, 24 Nov 2023 22:12:33 GMT
- Title: Inverse Preference Learning: Preference-based RL without a Reward
Function
- Authors: Joey Hejna, Dorsa Sadigh
- Abstract summary: Inverse Preference Learning (IPL) is specifically designed for learning from offline preference data.
Our key insight is that for a fixed policy, the $Q$-function encodes all information about the reward function, effectively making them interchangeable.
IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions.
- Score: 34.31087304327075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward functions are difficult to design and often hard to align with human
intent. Preference-based Reinforcement Learning (RL) algorithms address these
problems by learning reward functions from human feedback. However, the
majority of preference-based RL methods na\"ively combine supervised reward
models with off-the-shelf RL algorithms. Contemporary approaches have sought to
improve performance and query complexity by using larger and more complex
reward architectures such as transformers. Instead of using highly complex
architectures, we develop a new and parameter-efficient algorithm, Inverse
Preference Learning (IPL), specifically designed for learning from offline
preference data. Our key insight is that for a fixed policy, the $Q$-function
encodes all information about the reward function, effectively making them
interchangeable. Using this insight, we completely eliminate the need for a
learned reward function. Our resulting algorithm is simpler and more
parameter-efficient. Across a suite of continuous control and robotics
benchmarks, IPL attains competitive performance compared to more complex
approaches that leverage transformer-based and non-Markovian reward functions
while having fewer algorithmic hyperparameters and learned network parameters.
Our code is publicly released.
Related papers
- Few-shot In-Context Preference Learning Using Large Language Models [15.84585737510038]
Designing reward functions is a core component of reinforcement learning.
It can be exceedingly inefficient to learn rewards as they are often learned tabula rasa.
We propose In-Context Preference Learning (ICPL) to accelerate learning reward functions from preferences.
arXiv Detail & Related papers (2024-10-22T17:53:34Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Learning Dexterous Manipulation from Suboptimal Experts [69.8017067648129]
Relative Entropy Q-Learning (REQ) is a simple policy algorithm that combines ideas from successful offline and conventional RL algorithms.
We show how REQ is also effective for general off-policy RL, offline RL, and RL from demonstrations.
arXiv Detail & Related papers (2020-10-16T18:48:49Z) - Active Finite Reward Automaton Inference and Reinforcement Learning
Using Queries and Counterexamples [31.31937554018045]
Deep reinforcement learning (RL) methods require intensive data from the exploration of the environment to achieve satisfactory performance.
We propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations.
Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm.
arXiv Detail & Related papers (2020-06-28T21:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.