Related papers: Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration

Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration

URL: http://arxiv.org/abs/2502.00666v2
Date: Sun, 09 Feb 2025 20:16:15 GMT
Title: Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration
Authors: Mingyu Chen, Yiding Chen, Wen Sun, Xuezhou Zhang,
Abstract summary: Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment.<n>This paper studies the setting of online RLHF and focus on improving sample efficiency.
Score: 20.76451379043945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024).. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design. The code is available at https://github.com/MYC000801/SE-POPO.

Related papers

Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits [59.30310692855397]
We propose a unified framework for the RLHF pipeline from the view of contextual bandits. We decompose the RLHF process into two distinct stages: (post-)training and deployment. We then develop novel algorithms for each stage, demonstrating significant improvements in both statistical and computational efficiency.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
Online Preference Alignment for Language Models via Count-based Exploration [46.46627519343809]
Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences.<n>Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage.<n>Online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs.
arXiv Detail & Related papers (2025-01-22T09:12:09Z)
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning [42.33815055388433]
ARLBench is a benchmark for hyperparameter optimization (HPO) in reinforcement learning (RL) It allows comparisons of diverse HPO approaches while being highly efficient in evaluation. ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL.
arXiv Detail & Related papers (2024-09-27T15:22:28Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [15.038210624870656]
Reward inference is a critical intermediate step in the Reinforcement Learning from Human Feedback pipeline. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDP bandit, and general preference models beyond the Bradley-Terry model.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO) XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling, and Beyond [29.449446595110643]
We propose a notion of data diversity that subsumes the previous notions of coverage measures in offline RL. Our proposed model-free PS-based algorithm for offline RL is novel, with sub-optimality bounds that are frequentist (i.e., worst-case) in nature.
arXiv Detail & Related papers (2024-01-06T20:52:04Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration. We show that this separation does not exist in the setting of linear MDPs. We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.