Related papers: Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

URL: http://arxiv.org/abs/2508.15327v3
Date: Fri, 10 Oct 2025 03:54:40 GMT
Title: Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning
Authors: Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li,
Abstract summary: We introduce a Search-Based Preference Weighting scheme to unify two feedback sources.<n>For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations.<n>These weights are then used to guide standard preference learning, enabling more accurate credit assignment.
Score: 83.64755389431971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

Related papers

Rectifying Shortcut Behaviors in Preference-based Reward Learning [46.09046818725698]
In reinforcement learning, preference-based reward models play a central role in aligning large language models to human-aligned behavior.<n>Recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization.<n>We introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning.
arXiv Detail & Related papers (2025-10-21T20:08:32Z)
Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning [6.621247723203913]
Similarity as Reward Alignment (SARA) is a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats and training paradigms.<n>SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent.<n>We demonstrate strong performance compared to baselines on continuous control offline RL benchmarks.
arXiv Detail & Related papers (2025-06-14T15:01:59Z)
CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries [13.06534916144093]
We propose Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY)<n>CLARIFY learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart.<n>Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
arXiv Detail & Related papers (2025-05-31T04:37:07Z)
Variational Bayesian Personalized Ranking [39.24591060825056]
Variational BPR is a novel and easily implementable learning objective that integrates likelihood optimization, noise reduction, and popularity debiasing.<n>We introduce an attention-based latent interest prototype contrastive mechanism, replacing instance-level contrastive learning, to effectively reduce noise from problematic samples.<n> Empirically, we demonstrate the effectiveness of Variational BPR on popular backbone recommendation models.
arXiv Detail & Related papers (2025-03-14T04:22:01Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences [12.775486996512434]
Preference-Based reinforcement learning learns directly from the preferences of human teachers regarding agent behaviors. Existing PBRL methods often learn from explicit preferences, neglecting the possibility that teachers may choose equal preferences. We propose a novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences.
arXiv Detail & Related papers (2024-09-11T13:43:49Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.<n>We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.<n>Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
Just Label What You Need: Fine-Grained Active Selection for Perception and Prediction through Partially Labeled Scenes [78.23907801786827]
We introduce generalizations that ensure that our approach is both cost-aware and allows for fine-grained selection of examples through partially labeled scenes. Our experiments on a real-world, large-scale self-driving dataset suggest that fine-grained selection can improve the performance across perception, prediction, and downstream planning tasks.
arXiv Detail & Related papers (2021-04-08T17:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.