Is RLHF More Difficult than Standard RL?
- URL: http://arxiv.org/abs/2306.14111v2
- Date: Fri, 3 Nov 2023 18:36:03 GMT
- Title: Is RLHF More Difficult than Standard RL?
- Authors: Yuanhao Wang, Qinghua Liu, Chi Jin
- Abstract summary: Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals.
This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs.
- Score: 31.972393805014903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward
signals. Preferences arguably contain less information than rewards, which
makes preference-based RL seemingly more difficult. This paper theoretically
proves that, for a wide range of preference models, we can solve
preference-based RL directly using existing algorithms and techniques for
reward-based RL, with small or no extra costs. Specifically, (1) for
preferences that are drawn from reward-based probabilistic models, we reduce
the problem to robust reward-based RL that can tolerate small errors in
rewards; (2) for general arbitrary preferences where the objective is to find
the von Neumann winner, we reduce the problem to multiagent reward-based RL
which finds Nash equilibria for factored Markov games with a restricted set of
policies. The latter case can be further reduced to adversarial MDP when
preferences only depend on the final state. We instantiate all reward-based RL
subroutines by concrete provable algorithms, and apply our theory to a large
class of models including tabular MDPs and MDPs with generic function
approximation. We further provide guarantees when K-wise comparisons are
available.
Related papers
- Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Direct Preference-based Policy Optimization without Reward Modeling [25.230992130108767]
Preference-based reinforcement learning (PbRL) is an approach that enables RL agents to learn from preference.
We propose a PbRL algorithm that directly learns from preference without requiring any reward modeling.
We show that our algorithm surpasses offline RL methods that learn with ground-truth reward information.
arXiv Detail & Related papers (2023-01-30T12:51:13Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z) - A Simple Reward-free Approach to Constrained Reinforcement Learning [33.813302183231556]
This paper bridges reward-free RL and constrained RL. Particularly, we propose a simple meta-algorithm such that given any reward-free RL oracle, the approachability and constrained RL problems can be directly solved with negligible overheads in sample complexity.
arXiv Detail & Related papers (2021-07-12T06:27:30Z) - Preference-based Reinforcement Learning with Finite-Time Guarantees [76.88632321436472]
Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning to better elicit human opinion on the target objective.
Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy.
We present the first finite-time analysis for general PbRL problems.
arXiv Detail & Related papers (2020-06-16T03:52:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.