Related papers: Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

URL: http://arxiv.org/abs/2506.09508v1
Date: Wed, 11 Jun 2025 08:27:16 GMT
Title: Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design
Authors: Andreas Schlaginhaufen, Reda Ouhamma, Maryam Kamgarpour,
Abstract summary: We study reinforcement learning from human feedback in general Markov decision processes.<n>A central challenge is to design algorithms that select informative preference queries to identify the underlying reward.<n>We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches.
Score: 11.313040194648828
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative preference queries to identify the underlying reward while ensuring theoretical guarantees. We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches and remains tractable. We establish both regret and last-iterate guarantees under mild reinforcement learning oracle assumptions. To improve query complexity, we introduce and analyze an improved algorithm that collects batches of trajectory pairs and applies optimal experimental design to select informative comparison queries. The batch structure also enables parallelization of preference queries, which is relevant in practical deployment as feedback can be gathered concurrently. Empirical evaluation confirms that the proposed method is competitive with reward-based reinforcement learning while requiring a small number of preference queries.

Related papers

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds [1.6385815610837167]
In this work, we introduce innovative learning-rate-free algorithms for optimization over Riemannian. We establish high probability convergence guarantees that are optimal, up to logarithmic factors, compared to the best-known optimally tuned rate in the deterministic setting. Our approach is validated through numerical experiments, demonstrating competitive performance against learning-rate-dependent algorithms.
arXiv Detail & Related papers (2024-06-04T13:17:24Z)
Sample-Efficient "Clustering and Conquer" Procedures for Parallel Large-Scale Ranking and Selection [0.0]
We modify the commonly used "divide and conquer" framework in parallel computing by adding a correlation-based clustering step.<n>This seemingly simple modification achieves the optimal sample complexity reduction for a widely used class of efficient large-scale R&S procedures.<n>In large-scale AI applications such as neural architecture search, our methods demonstrate superior performance.
arXiv Detail & Related papers (2024-02-03T15:56:03Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Correlation Clustering with Active Learning of Pairwise Similarities [3.86170450233149]
Correlation clustering is a well-known unsupervised learning setting that deals with positive and negative pairwise similarities. In this paper, we study the case where the pairwise similarities are not given in advance and must be queried in a cost-efficient way. We develop a generic active learning framework for this task that benefits from several advantages.
arXiv Detail & Related papers (2023-02-20T20:39:07Z)
On the Complexity of Adversarial Decision Making [101.14158787665252]
We show that the Decision-Estimation Coefficient is necessary and sufficient to obtain low regret for adversarial decision making. We provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures.
arXiv Detail & Related papers (2022-06-27T06:20:37Z)
Large-Scale Sequential Learning for Recommender and Engineering Systems [91.3755431537592]
In this thesis, we focus on the design of an automatic algorithms that provide personalized ranking by adapting to the current conditions. For the former, we propose novel algorithm called SAROS that take into account both kinds of feedback for learning over the sequence of interactions. The proposed idea of taking into account the neighbour lines shows statistically significant results in comparison with the initial approach for faults detection in power grid.
arXiv Detail & Related papers (2022-05-13T21:09:41Z)
Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits [36.37578212532926]
We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy. We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms.
arXiv Detail & Related papers (2021-06-28T17:37:49Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
Outlier Detection Ensemble with Embedded Feature Selection [42.8338013000469]
We propose an outlier detection ensemble framework with embedded feature selection (ODEFS) For each random sub-sampling based learning component, ODEFS unifies feature selection and outlier detection into a pairwise ranking formulation. We adopt the thresholded self-paced learning to simultaneously optimize feature selection and example selection.
arXiv Detail & Related papers (2020-01-15T13:14:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.