Direct Preference-based Policy Optimization without Reward Modeling
- URL: http://arxiv.org/abs/2301.12842v3
- Date: Fri, 27 Oct 2023 08:14:48 GMT
- Title: Direct Preference-based Policy Optimization without Reward Modeling
- Authors: Gaon An, Junhyeok Lee, Xingdong Zuo, Norio Kosaka, Kyung-Min Kim, Hyun
Oh Song
- Abstract summary: Preference-based reinforcement learning (PbRL) is an approach that enables RL agents to learn from preference.
We propose a PbRL algorithm that directly learns from preference without requiring any reward modeling.
We show that our algorithm surpasses offline RL methods that learn with ground-truth reward information.
- Score: 25.230992130108767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference-based reinforcement learning (PbRL) is an approach that enables RL
agents to learn from preference, which is particularly useful when formulating
a reward function is challenging. Existing PbRL methods generally involve a
two-step procedure: they first learn a reward model based on given preference
data and then employ off-the-shelf reinforcement learning algorithms using the
learned reward model. However, obtaining an accurate reward model solely from
preference information, especially when the preference is from human teachers,
can be difficult. Instead, we propose a PbRL algorithm that directly learns
from preference without requiring any reward modeling. To achieve this, we
adopt a contrastive learning framework to design a novel policy scoring metric
that assigns a high score to policies that align with the given preferences. We
apply our algorithm to offline RL tasks with actual human preference labels and
show that our algorithm outperforms or is on par with the existing PbRL
methods. Notably, on high-dimensional control tasks, our algorithm surpasses
offline RL methods that learn with ground-truth reward information. Finally, we
show that our algorithm can be successfully applied to fine-tune large language
models.
Related papers
- Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z) - Preference-based Reinforcement Learning with Finite-Time Guarantees [76.88632321436472]
Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning to better elicit human opinion on the target objective.
Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy.
We present the first finite-time analysis for general PbRL problems.
arXiv Detail & Related papers (2020-06-16T03:52:41Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.