Related papers: Optimal Reward Labeling: Bridging Offline Preference and Reward-Based Reinforcement Learning

Optimal Reward Labeling: Bridging Offline Preference and Reward-Based Reinforcement Learning

URL: http://arxiv.org/abs/2406.10445v2
Date: Sat, 6 Jul 2024 20:03:16 GMT
Title: Optimal Reward Labeling: Bridging Offline Preference and Reward-Based Reinforcement Learning
Authors: Yinglun Xu, David Zhu, Rohan Gumaste, Gagandeep Singh,
Abstract summary: We propose a framework to transfer the rich understanding of offline RL from the reward-based to the preference-based setting. Our key insight is transforming preference feedback to scalar rewards via optimal reward labeling (ORL) We empirically test our framework on preference datasets based on the standard D4RL benchmark.
Score: 5.480108613013526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline reinforcement learning has become one of the most practical RL settings. A recent success story has been RLHF, offline preference-based RL (PBRL) with preference from humans. However, most existing works on offline RL focus on the standard setting with scalar reward feedback. It remains unknown how to universally transfer the existing rich understanding of offline RL from the reward-based to the preference-based setting. In this work, we propose a general framework to bridge this gap. Our key insight is transforming preference feedback to scalar rewards via optimal reward labeling (ORL), and then any reward-based offline RL algorithms can be applied to the dataset with the reward labels. We theoretically show the connection between several recent PBRL techniques and our framework combined with specific offline RL algorithms in terms of how they utilize the preference signals. By combining reward labeling with different algorithms, our framework can lead to new and potentially more efficient offline PBRL algorithms. We empirically test our framework on preference datasets based on the standard D4RL benchmark. When combined with a variety of efficient reward-based offline RL algorithms, the learning result achieved under our framework is comparable to training the same algorithm on the dataset with actual rewards in many cases and better than the recent PBRL baselines in most cases.

Related papers

Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning [6.621247723203913]
Similarity as Reward Alignment (SARA) is a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats and training paradigms.<n>SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent.<n>We demonstrate strong performance compared to baselines on continuous control offline RL benchmarks.
arXiv Detail & Related papers (2025-06-14T15:01:59Z)
Listwise Reward Estimation for Offline Preference-based Reinforcement Learning [20.151932308777553]
Listwise Reward Estimation (LiRE) is a novel approach for offline Preference-based Reinforcement Learning (PbRL) LiRE builds on existing PbRL methods by constructing a Ranked List of Trajectories (RLT) Our experiments demonstrate the superiority of LiRE, even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise.
arXiv Detail & Related papers (2024-08-08T03:18:42Z)
Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL. We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z)
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL. Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z)
Is RLHF More Difficult than Standard RL? [31.972393805014903]
Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs.
arXiv Detail & Related papers (2023-06-25T03:18:15Z)
Improving Offline RL by Blending Heuristics [33.810026421228635]
Heuristic Blending improves performance of offline RL algorithms based on value bootstrapping. HubL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms.
arXiv Detail & Related papers (2023-06-01T03:36:06Z)
Optimal Transport for Offline Imitation Learning [31.218468923400373]
offline reinforcement learning (RL) is a promising framework for learning good decision-making policies without the need to interact with the real environment. We introduce Optimal Transport Reward labeling (OTR), an algorithm that assigns rewards to offline trajectories. We show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards.
arXiv Detail & Related papers (2023-03-24T12:45:42Z)
Benchmarks and Algorithms for Offline Preference-Based Reward Learning [41.676208473752425]
We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning. Our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps.
arXiv Detail & Related papers (2023-01-03T23:52:16Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly. B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z)
Offline Meta-Reinforcement Learning with Online Self-Supervision [66.42016534065276]
We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy. Our method uses the offline data to learn the distribution of reward functions, which is then sampled to self-supervise reward labels for the additional online data. We find that using additional data and self-generated rewards significantly improves an agent's ability to generalize.
arXiv Detail & Related papers (2021-07-08T17:01:32Z)
Preference-based Reinforcement Learning with Finite-Time Guarantees [76.88632321436472]
Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning to better elicit human opinion on the target objective. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. We present the first finite-time analysis for general PbRL problems.
arXiv Detail & Related papers (2020-06-16T03:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.