Active Human Feedback Collection via Neural Contextual Dueling Bandits
- URL: http://arxiv.org/abs/2504.12016v1
- Date: Wed, 16 Apr 2025 12:16:10 GMT
- Title: Active Human Feedback Collection via Neural Contextual Dueling Bandits
- Authors: Arun Verma, Xiaoqiang Lin, Zhongxiang Dai, Daniela Rus, Bryan Kian Hsiang Low,
- Abstract summary: We propose Neural-ADB, an algorithm for collecting human preference feedback when the underlying latent reward function is non-linear.<n>We show that when preference feedback follows the Bradley-Terry-Luce model, the worst sub-optimality gap of the policy learned by Neural-ADB decreases at a sub-linear rate as the preference dataset increases.
- Score: 84.7608942821423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Collecting human preference feedback is often expensive, leading recent works to develop principled algorithms to select them more efficiently. However, these works assume that the underlying reward function is linear, an assumption that does not hold in many real-life applications, such as online recommendation and LLM alignment. To address this limitation, we propose Neural-ADB, an algorithm based on the neural contextual dueling bandit framework that provides a principled and practical method for collecting human preference feedback when the underlying latent reward function is non-linear. We theoretically show that when preference feedback follows the Bradley-Terry-Luce model, the worst sub-optimality gap of the policy learned by Neural-ADB decreases at a sub-linear rate as the preference dataset increases. Our experimental results on problem instances derived from synthetic preference datasets further validate the effectiveness of Neural-ADB.
Related papers
- Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback [46.7385883375784]
Extragradient preference optimization (EGPO) is an algorithm for NLHF achieving last-iterate linear convergence to the NE of KL-regularized games.<n>Our empirical evaluations demonstrate EGPO's superior performance over baseline methods when training for the same number of epochs.
arXiv Detail & Related papers (2025-03-11T22:44:54Z) - Active Learning for Direct Preference Optimization [59.84525302418018]
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback.<n>We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline.
arXiv Detail & Related papers (2025-03-03T00:36:31Z) - Best Policy Learning from Trajectory Preference Feedback [15.799929216215672]
We address the problem of best policy identification in preference-based reinforcement learning (PbRL)<n>We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling.<n>We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret.
arXiv Detail & Related papers (2025-01-31T03:55:10Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We show that our approach consistently boosts DPO by a considerable margin.
Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Neural Dueling Bandits: Preference-Based Optimization with Human Feedback [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.<n>We also extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback [6.578074497549894]
Inverse Reinforcement Learning (IRL) and Reinforcement Learning from Human Feedback (RLHF) are pivotal methodologies in reward learning.
This paper introduces a novel linear programming (LP) framework tailored for offline reward learning.
arXiv Detail & Related papers (2024-05-20T23:59:26Z) - Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.
We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.
Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Offline Neural Contextual Bandits: Pessimism, Optimization and
Generalization [42.865641215856925]
We propose a provably efficient offline contextual bandit with neural network function approximation.
We show that our method generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works.
We also demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.
arXiv Detail & Related papers (2021-11-27T03:57:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.