Related papers: Preference-based Reinforcement Learning with Finite-Time Guarantees

Preference-based Reinforcement Learning with Finite-Time Guarantees

URL: http://arxiv.org/abs/2006.08910v2
Date: Fri, 23 Oct 2020 20:24:58 GMT
Title: Preference-based Reinforcement Learning with Finite-Time Guarantees
Authors: Yichong Xu, Ruosong Wang, Lin F. Yang, Aarti Singh and Artur Dubrawski
Abstract summary: Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning to better elicit human opinion on the target objective. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. We present the first finite-time analysis for general PbRL problems.
Score: 76.88632321436472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. In this paper, we present the first finite-time analysis for general PbRL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for PbRL, both with and without a simulator, that are able to identify the best policy up to accuracy $\varepsilon$ with high probability. Our method explores the state space by navigating to under-explored states, and solves PbRL using a combination of dueling bandits and policy search. Experiments show the efficacy of our method when it is applied to real-world problems.

Related papers

Best Policy Learning from Trajectory Preference Feedback [15.799929216215672]
We address the problem of best policy identification in preference-based reinforcement learning (PbRL) We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret.
arXiv Detail & Related papers (2025-01-31T03:55:10Z)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO) Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z)
RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences [23.414135977983953]
Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. We present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences.
arXiv Detail & Related papers (2024-02-27T07:03:25Z)
Is RLHF More Difficult than Standard RL? [31.972393805014903]
Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs.
arXiv Detail & Related papers (2023-06-25T03:18:15Z)
Direct Preference-based Policy Optimization without Reward Modeling [25.230992130108767]
Preference-based reinforcement learning (PbRL) is an approach that enables RL agents to learn from preference. We propose a PbRL algorithm that directly learns from preference without requiring any reward modeling. We show that our algorithm surpasses offline RL methods that learn with ground-truth reward information.
arXiv Detail & Related papers (2023-01-30T12:51:13Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly. B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z)
Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.