Related papers: Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning

Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning

URL: http://arxiv.org/abs/2506.12529v1
Date: Sat, 14 Jun 2025 15:01:59 GMT
Title: Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning
Authors: Sara Rajaram, R. James Cotton, Fabian H. Sinz,
Abstract summary: Similarity as Reward Alignment (SARA) is a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats and training paradigms.<n>SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent.<n>We demonstrate strong performance compared to baselines on continuous control offline RL benchmarks.
Score: 6.621247723203913
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Preference-based Reinforcement Learning (PbRL) entails a variety of approaches for aligning models with human intent to alleviate the burden of reward engineering. However, most previous PbRL work has not investigated the robustness to labeler errors, inevitable with labelers who are non-experts or operate under time constraints. Additionally, PbRL algorithms often target very specific settings (e.g. pairwise ranked preferences or purely offline learning). We introduce Similarity as Reward Alignment (SARA), a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats and training paradigms. SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent. We demonstrate strong performance compared to baselines on continuous control offline RL benchmarks. We further demonstrate SARA's versatility in applications such as trajectory filtering for downstream tasks, cross-task preference transfer, and reward shaping in online learning.

Related papers

CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries [13.06534916144093]
We propose Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY)<n>CLARIFY learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart.<n>Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
arXiv Detail & Related papers (2025-05-31T04:37:07Z)
Binary Reward Labeling: Bridging Offline Preference and Reward-Based Reinforcement Learning [5.480108613013526]
We propose a general framework to bridge the gap between reward-based offline RL and preference-based offline RL. Our key insight is transforming preference feedback to scalar rewards via binary reward labeling (BRL) We empirically test our framework on preference datasets based on the standard D4RL benchmark.
arXiv Detail & Related papers (2024-06-14T23:40:42Z)
Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z)
RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences [23.414135977983953]
Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. We present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences.
arXiv Detail & Related papers (2024-02-27T07:03:25Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world. It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment. We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent. We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)
Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL) We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly. B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.