A General Theoretical Paradigm to Understand Learning from Human
Preferences
- URL: http://arxiv.org/abs/2310.12036v2
- Date: Wed, 22 Nov 2023 00:02:49 GMT
- Title: A General Theoretical Paradigm to Understand Learning from Human
Preferences
- Authors: Mohammad Gheshlaghi Azar and Mark Rowland and Bilal Piot and Daniel
Guo and Daniele Calandriello and Michal Valko and R\'emi Munos
- Abstract summary: We derive a new general objective called $Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences.
This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO.
- Score: 33.65903139056413
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The prevalent deployment of learning from human preferences through
reinforcement learning (RLHF) relies on two important approximations: the first
assumes that pairwise preferences can be substituted with pointwise rewards.
The second assumes that a reward model trained on these pointwise rewards can
generalize from collected data to out-of-distribution data sampled by the
policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an
approach that bypasses the second approximation and learn directly a policy
from collected data without the reward modelling stage. However, this method
still heavily relies on the first approximation.
In this paper we try to gain a deeper theoretical understanding of these
practical algorithms. In particular we derive a new general objective called
$\Psi$PO for learning from human preferences that is expressed in terms of
pairwise preferences and therefore bypasses both approximations. This new
general objective allows us to perform an in-depth analysis of the behavior of
RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential
pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$
simply to Identity, for which we can derive an efficient optimisation
procedure, prove performance guarantees and demonstrate its empirical
superiority to DPO on some illustrative examples.
Related papers
- $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [91.43730624072226]
$f$-PO is a novel framework that generalizes and extends existing approaches.
We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z) - Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both [6.102274021710727]
Direct Reward Distillation and policy-Optimization (DRDO) is a supervised knowledge distillation-based preference alignment method.
DRDO directly mimics rewards assigned by an oracle while learning human preferences from a novel preference likelihood formulation.
Our experimental results on the Ultrafeedback and TL;DR datasets demonstrate that policies trained using DRDO surpass previous methods.
arXiv Detail & Related papers (2024-10-11T02:19:11Z) - Zeroth-Order Policy Gradient for Reinforcement Learning from Human
Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference.
The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator.
Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - $i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization [12.266207199002604]
Large Language Models (LLM) can sometimes produce outputs that deviate from human expectations.
We propose a novel framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization.
We show that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators.
arXiv Detail & Related papers (2024-05-24T05:42:11Z) - Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences [24.645259298082436]
We take a step towards a deeper understanding of learning from human preferences by systematically comparing the paradigm of reinforcement learning from human feedback (RLHF) with the recently proposed paradigm of direct preference optimization (DPO)
We derive minimax statistical bounds on the suboptimality gap induced by both RLHF and DPO.
We extend our analysis to the approximate optimization setting and derive exponentially decaying convergence rates for both RLHF and DPO.
arXiv Detail & Related papers (2024-03-04T09:13:14Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.