Preference Elicitation for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2406.18450v1
- Date: Wed, 26 Jun 2024 15:59:13 GMT
- Title: Preference Elicitation for Offline Reinforcement Learning
- Authors: Alizée Pace, Bernhard Schölkopf, Gunnar Rätsch, Giorgia Ramponi,
- Abstract summary: We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm.
Our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy.
- Score: 59.136381500967744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.
Related papers
- Hindsight Preference Learning for Offline Preference-based Reinforcement Learning [22.870967604847458]
Offline preference-based reinforcement learning (RL) focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset.
We propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments.
Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets.
arXiv Detail & Related papers (2024-07-05T12:05:37Z) - Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO)
VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function.
Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z) - Learning Goal-Conditioned Policies from Sub-Optimal Offline Data via Metric Learning [22.174803826742963]
We address the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning.
We propose the use of metric learning to approximate the optimal value function for goal-conditioned offline RL problems.
We show that our method estimates optimal behaviors from severely sub-optimal offline datasets without suffering from out-of-distribution estimation errors.
arXiv Detail & Related papers (2024-02-16T16:46:53Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Beyond Reward: Offline Preference-guided Policy Optimization [18.49648170835782]
offline preference-based reinforcement learning (PbRL) is a variant of conventional reinforcement learning that dispenses with the need for online interaction.
This study focuses on the topic of offline preference-guided policy optimization (OPPO)
OPPO models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function.
arXiv Detail & Related papers (2023-05-25T16:24:11Z) - Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid
Reinforcement Learning [66.43003402281659]
A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset.
We design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL.
The proposed algorithm does not require any reward information during data collection.
arXiv Detail & Related papers (2023-05-17T15:17:23Z) - Benchmarks and Algorithms for Offline Preference-Based Reward Learning [41.676208473752425]
We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning.
Our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps.
arXiv Detail & Related papers (2023-01-03T23:52:16Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.