Beyond Reward: Offline Preference-guided Policy Optimization
- URL: http://arxiv.org/abs/2305.16217v2
- Date: Fri, 9 Jun 2023 07:48:49 GMT
- Title: Beyond Reward: Offline Preference-guided Policy Optimization
- Authors: Yachen Kang, Diyuan Shi, Jinxin Liu, Li He, Donglin Wang
- Abstract summary: offline preference-based reinforcement learning (PbRL) is a variant of conventional reinforcement learning that dispenses with the need for online interaction.
This study focuses on the topic of offline preference-guided policy optimization (OPPO)
OPPO models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function.
- Score: 18.49648170835782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study focuses on the topic of offline preference-based reinforcement
learning (PbRL), a variant of conventional reinforcement learning that
dispenses with the need for online interaction or specification of reward
functions. Instead, the agent is provided with fixed offline trajectories and
human preferences between pairs of trajectories to extract the dynamics and
task information, respectively. Since the dynamics and task information are
orthogonal, a naive approach would involve using preference-based reward
learning followed by an off-the-shelf offline RL algorithm. However, this
requires the separate learning of a scalar reward function, which is assumed to
be an information bottleneck of the learning process. To address this issue, we
propose the offline preference-guided policy optimization (OPPO) paradigm,
which models offline trajectories and preferences in a one-step process,
eliminating the need for separately learning a reward function. OPPO achieves
this by introducing an offline hindsight information matching objective for
optimizing a contextual policy and a preference modeling objective for finding
the optimal context. OPPO further integrates a well-performing decision policy
by optimizing the two objectives iteratively. Our empirical results demonstrate
that OPPO effectively models offline preferences and outperforms prior
competing baselines, including offline RL algorithms performed over either true
or pseudo reward function specifications. Our code is available on the project
website: https://sites.google.com/view/oppo-icml-2023 .
Related papers
- Preference Elicitation for Offline Reinforcement Learning [59.136381500967744]
We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm.
Our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy.
arXiv Detail & Related papers (2024-06-26T15:59:13Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO)
VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function.
Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z) - Learning Goal-Conditioned Policies from Sub-Optimal Offline Data via Metric Learning [22.174803826742963]
We address the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning.
We propose the use of metric learning to approximate the optimal value function for goal-conditioned offline RL problems.
We show that our method estimates optimal behaviors from severely sub-optimal offline datasets without suffering from out-of-distribution estimation errors.
arXiv Detail & Related papers (2024-02-16T16:46:53Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - From Function to Distribution Modeling: A PAC-Generative Approach to
Offline Optimization [30.689032197123755]
This paper considers the problem of offline optimization, where the objective function is unknown except for a collection of offline" data examples.
Instead of learning and then optimizing the unknown objective function, we take on a less intuitive but more direct view that optimization can be thought of as a process of sampling from a generative model.
arXiv Detail & Related papers (2024-01-04T01:32:50Z) - Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning [9.341618348621662]
We aim to find the best-performing policy within a limited budget of online interactions.
We first study the major online RL exploration methods based on intrinsic rewards and UCB.
We then introduce an algorithm for planning to go out-of-distribution that avoids these issues.
arXiv Detail & Related papers (2023-10-09T13:47:05Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z) - Representation Matters: Offline Pretraining for Sequential Decision
Making [27.74988221252854]
In this paper, we consider a slightly different approach to incorporating offline data into sequential decision-making.
We find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms.
arXiv Detail & Related papers (2021-02-11T02:38:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.