Principled Reinforcement Learning with Human Feedback from Pairwise or
$K$-wise Comparisons
- URL: http://arxiv.org/abs/2301.11270v5
- Date: Thu, 8 Feb 2024 04:16:52 GMT
- Title: Principled Reinforcement Learning with Human Feedback from Pairwise or
$K$-wise Comparisons
- Authors: Banghua Zhu, Jiantao Jiao, Michael I. Jordan
- Abstract summary: We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF)
We show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions.
- Score: 79.98542868281473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide a theoretical framework for Reinforcement Learning with Human
Feedback (RLHF). Our analysis shows that when the true reward function is
linear, the widely used maximum likelihood estimator (MLE) converges under both
the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However,
we show that when training a policy based on the learned reward model, MLE
fails while a pessimistic MLE provides policies with improved performance under
certain coverage assumptions. Additionally, we demonstrate that under the PL
model, the true MLE and an alternative MLE that splits the $K$-wise comparison
into pairwise comparisons both converge. Moreover, the true MLE is
asymptotically more efficient. Our results validate the empirical success of
existing RLHF algorithms in InstructGPT and provide new insights for algorithm
design. Furthermore, our results unify the problem of RLHF and max-entropy
Inverse Reinforcement Learning (IRL), and provide the first sample complexity
bound for max-entropy IRL.
Related papers
- Zeroth-Order Policy Gradient for Reinforcement Learning from Human
Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference.
The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator.
Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.
The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - Generalizing Reward Modeling for Out-of-Distribution Preference Learning [3.9160947065896803]
Preference learning with large language models (LLMs) aims to align the LLMs' generations with human preferences.
Due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging.
This work addresses OOD PL by optimizing a general reward model through a meta-learning approach.
arXiv Detail & Related papers (2024-02-22T18:20:33Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Inverse Reinforcement Learning for Text Summarization [52.765898203824975]
We introduce inverse reinforcement learning (IRL) as an effective paradigm for training abstractive summarization models.
Experimental results across datasets in different domains demonstrate the superiority of our proposed IRL model for summarization over MLE and RL baselines.
arXiv Detail & Related papers (2022-12-19T23:45:05Z) - Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo
sampling [58.14878401145309]
We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model.
We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.
arXiv Detail & Related papers (2022-05-12T11:15:47Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.