Provable Offline Preference-Based Reinforcement Learning
- URL: http://arxiv.org/abs/2305.14816v2
- Date: Fri, 29 Sep 2023 19:18:55 GMT
- Title: Provable Offline Preference-Based Reinforcement Learning
- Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun
- Abstract summary: We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
- Score: 95.00042541409901
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper, we investigate the problem of offline Preference-based
Reinforcement Learning (PbRL) with human feedback where feedback is available
in the form of preference between trajectory pairs rather than explicit
rewards. Our proposed algorithm consists of two main steps: (1) estimate the
implicit reward using Maximum Likelihood Estimation (MLE) with general function
approximation from offline data and (2) solve a distributionally robust
planning problem over a confidence set around the MLE. We consider the general
reward setting where the reward can be defined over the whole trajectory and
provide a novel guarantee that allows us to learn any target policy with a
polynomial number of samples, as long as the target policy is covered by the
offline data. This guarantee is the first of its kind with general function
approximation. To measure the coverage of the target policy, we introduce a new
single-policy concentrability coefficient, which can be upper bounded by the
per-trajectory concentrability coefficient. We also establish lower bounds that
highlight the necessity of such concentrability and the difference from
standard RL, where state-action-wise rewards are directly observed. We further
extend and analyze our algorithm when the feedback is given over action pairs.
Related papers
- Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Offline Reinforcement Learning with Additional Covering Distributions [0.0]
We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation.
We show that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes.
arXiv Detail & Related papers (2023-05-22T03:31:03Z) - Goal-conditioned Offline Reinforcement Learning through State Space Partitioning [9.38848713730931]
offline reinforcement learning (RL) aims to infer sequential decision policies using only offline datasets.
We argue that, despite its benefits, this approach is still insufficient to fully address the distribution shift and multi-modality problems.
We propose a complementary advantage-based weighting scheme that introduces an additional source of inductive bias.
arXiv Detail & Related papers (2023-03-16T14:52:53Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.