Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning
- URL: http://arxiv.org/abs/2509.26605v2
- Date: Mon, 13 Oct 2025 13:00:55 GMT
- Title: Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning
- Authors: Maƫl Macuglia, Paul Friedrich, Giorgia Ramponi,
- Abstract summary: We propose a two-stage framework that learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback.<n>We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective.<n>We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL.
- Score: 8.657536710294766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.
Related papers
- Offline Behavioral Data Selection [58.116300485427764]
We show that policy performance rapidly saturates when trained on a small fraction of the dataset.<n>We propose a simple yet effective method, Stepwise Dual Ranking (SDR), which extracts a compact yet informative subset from large-scale offline behavioral datasets.<n>Extensive experiments and ablation studies on D4RL benchmarks demonstrate that SDR significantly enhances data selection for offline behavioral data.
arXiv Detail & Related papers (2025-12-20T07:10:58Z) - Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach [2.9690567171043725]
This study centers on how to learn and apply value envelopes within this context.<n>We introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms.
arXiv Detail & Related papers (2025-10-22T12:32:52Z) - Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning [8.087699764574788]
We propose an efficient algorithm for offline preference-based reinforcement learning (PbRL)<n>APPO guarantees sample complexity bounds without relying on explicit confidence sets.<n>To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability.
arXiv Detail & Related papers (2025-03-07T10:35:01Z) - Offline Learning for Combinatorial Multi-armed Bandits [56.96242764723241]
Off-CMAB is the first offline learning framework for CMAB.<n>Off-CMAB combines pessimistic reward estimations with solvers.<n>Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
arXiv Detail & Related papers (2025-01-31T16:56:18Z) - Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Understanding and Addressing the Pitfalls of Bisimulation-based
Representations in Offline Reinforcement Learning [34.66035026036424]
We aim to understand why bisimulation methods succeed in online settings, but falter in offline tasks.
Our analysis reveals that missing transitions in the dataset are particularly harmful to the bisimulation principle.
We implement these recommendations on two state-of-the-art bisimulation-based algorithms, MICo and SimSR.
arXiv Detail & Related papers (2023-10-26T04:20:55Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - A Simple Unified Uncertainty-Guided Framework for Offline-to-Online
Reinforcement Learning [25.123237633748193]
offline-to-online reinforcement learning can be challenging due to constrained exploratory behavior and state-action distribution shift.
We propose a Simple Unified uNcertainty-Guided (SUNG) framework, which unifies the solution to both challenges with the tool of uncertainty.
SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods.
arXiv Detail & Related papers (2023-06-13T05:22:26Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.