Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism
- URL: http://arxiv.org/abs/2305.18438v3
- Date: Mon, 3 Jul 2023 13:08:46 GMT
- Title: Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism
- Authors: Zihao Li, Zhuoran Yang, Mengdi Wang
- Abstract summary: We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
- Score: 91.52263068880484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study offline Reinforcement Learning with Human Feedback
(RLHF) where we aim to learn the human's underlying reward and the MDP's
optimal policy from a set of trajectories induced by human choices. RLHF is
challenging for multiple reasons: large state space but limited human feedback,
the bounded rationality of human decisions, and the off-policy distribution
shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for
modeling and understanding human choices. DCC, rooted in econometrics and
decision theory, is widely used to model a human decision-making process with
forward-looking and bounded rationality. We propose a
\underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization
(DCPPO) method. \ The method involves a three-stage process: The first step is
to estimate the human behavior policy and the state-action value function via
maximum likelihood estimation (MLE); the second step recovers the human reward
function via minimizing Bellman mean squared error using the learned value
functions; the third step is to plug in the learned reward and invoke
pessimistic value iteration for finding a near-optimal policy. With only
single-policy coverage (i.e., optimal policy) of the dataset, we prove that the
suboptimality of DCPPO almost matches the classical pessimistic offline RL
algorithm in terms of suboptimality's dependency on distribution shift and
dimension. To the best of our knowledge, this paper presents the first
theoretical guarantees for off-policy offline RLHF with dynamic discrete choice
model.
Related papers
- Zeroth-Order Policy Gradient for Reinforcement Learning from Human
Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference.
The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator.
Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment.
We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO)
XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback.
We term this approach Nash learning from human feedback (NLHF)
We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - The Boltzmann Policy Distribution: Accounting for Systematic
Suboptimality in Human Models [5.736353542430439]
We introduce the Boltzmann policy distribution (BPD), which serves as a prior over human policies.
BPD adapts via Bayesian inference to capture systematic deviations by observing human actions during a single episode.
We show that the BPD enables prediction of human behavior and human-AI collaboration equally as well as imitation learning-based human models.
arXiv Detail & Related papers (2022-04-22T15:26:25Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.