Reinforcement Learning from Human Feedback with Active Queries
- URL: http://arxiv.org/abs/2402.09401v1
- Date: Wed, 14 Feb 2024 18:58:40 GMT
- Title: Reinforcement Learning from Human Feedback with Active Queries
- Authors: Kaixuan Ji and Jiafan He and Quanquan Gu
- Abstract summary: Current reinforcement learning approaches often require a large amount of human-labelled preference data.
We propose query-efficient RLHF methods, inspired by the success of active learning.
Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
- Score: 67.27150911254155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning large language models (LLM) with human preference plays a key role
in building modern generative models and can be achieved by reinforcement
learning from human feedback (RLHF). Despite their superior performance,
current RLHF approaches often require a large amount of human-labelled
preference data, which is expensive to collect. In this paper, inspired by the
success of active learning, we address this problem by proposing
query-efficient RLHF methods. We first formalize the alignment problem as a
contextual dueling bandit problem and design an active-query-based proximal
policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ regret
bound and an $\tilde{O}(d^2/\Delta^2)$ query complexity, where $d$ is the
dimension of feature space and $\Delta$ is the sub-optimality gap over all the
contexts. We then propose ADPO, a practical version of our algorithm based on
direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our
experiments show that ADPO, while only making about half of queries for human
preference, matches the performance of the state-of-the-art DPO method.
Related papers
- Dual Active Learning for Reinforcement Learning from Human Feedback [13.732678966515781]
Reinforcement learning from human feedback (RLHF) is widely applied to align large language models with human preferences.
Human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label.
In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem.
arXiv Detail & Related papers (2024-10-03T14:09:58Z) - Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning [70.22819290458581]
Reinforcement learning with human feedback (RLHF) is a widely adopted approach in current large language model pipelines.
Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries.
arXiv Detail & Related papers (2024-07-02T10:09:19Z) - Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO)
Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses.
With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation.
We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models with human preferences.
Current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations.
We develop an active-learning algorithm, $textttAPO$, which enhances model alignment by querying preference data.
arXiv Detail & Related papers (2024-02-16T08:19:34Z) - Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence.
reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.