Related papers: Reinforcement Learning from Human Feedback with Active Queries

Reinforcement Learning from Human Feedback with Active Queries

URL: http://arxiv.org/abs/2402.09401v1
Date: Wed, 14 Feb 2024 18:58:40 GMT
Title: Reinforcement Learning from Human Feedback with Active Queries
Authors: Kaixuan Ji and Jiafan He and Quanquan Gu
Abstract summary: Current reinforcement learning approaches often require a large amount of human-labelled preference data. We propose query-efficient RLHF methods, inspired by the success of active learning. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
Score: 67.27150911254155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ regret bound and an $\tilde{O}(d^2/\Delta^2)$ query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

Related papers

Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Dual Active Learning for Reinforcement Learning from Human Feedback [13.732678966515781]
Reinforcement learning from human feedback (RLHF) is widely applied to align large language models with human preferences. Human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label. In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem.
arXiv Detail & Related papers (2024-10-03T14:09:58Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [15.038210624870656]
Reward inference is a critical intermediate step in the Reinforcement Learning from Human Feedback pipeline. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDP bandit, and general preference models beyond the Bradley-Terry model.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z)
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning [70.22819290458581]
Reinforcement learning with human feedback (RLHF) is a widely adopted approach in current large language model pipelines. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries.
arXiv Detail & Related papers (2024-07-02T10:09:19Z)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO) Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z)
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z)
Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models with human preferences. Current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations. We develop an active-learning algorithm, $textttAPO$, which enhances model alignment by querying preference data.
arXiv Detail & Related papers (2024-02-16T08:19:34Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Online Sub-Sampling for Reinforcement Learning with General Function Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm. For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times. In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.