CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries
- URL: http://arxiv.org/abs/2506.00388v3
- Date: Tue, 10 Jun 2025 13:10:39 GMT
- Title: CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries
- Authors: Ni Mu, Hao Hu, Xiao Hu, Yiqin Yang, Bo Xu, Qing-Shan Jia,
- Abstract summary: We propose Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY)<n>CLARIFY learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart.<n>Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
- Score: 13.06534916144093
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL's real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
Related papers
- PB$^2$: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning [2.0373030742807545]
We identify and address this preference exploration problem through population-based methods.<n>We demonstrate that maintaining a diverse population of agents enables more comprehensive exploration of the preference landscape.<n>This diversity improves reward model learning by generating preference queries with clearly distinguishable behaviors.
arXiv Detail & Related papers (2025-06-16T17:51:33Z) - Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning [6.621247723203913]
Similarity as Reward Alignment (SARA) is a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats and training paradigms.<n>SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent.<n>We demonstrate strong performance compared to baselines on continuous control offline RL benchmarks.
arXiv Detail & Related papers (2025-06-14T15:01:59Z) - Debiasing Online Preference Learning via Preference Feature Preservation [64.55924745257951]
Recent preference learning frameworks simplify human preferences with binary pairwise comparisons and scalar rewards.<n>This could make large language models' responses biased to mostly preferred features, and would be exacerbated during the iterations of online preference learning steps.<n>We propose Preference Feature Preservation to maintain the distribution of human preference features and utilize such rich signals throughout the online preference learning process.
arXiv Detail & Related papers (2025-06-06T13:19:07Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences [12.775486996512434]
Preference-Based reinforcement learning learns directly from the preferences of human teachers regarding agent behaviors.
Existing PBRL methods often learn from explicit preferences, neglecting the possibility that teachers may choose equal preferences.
We propose a novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences.
arXiv Detail & Related papers (2024-09-11T13:43:49Z) - Preference-Guided Reinforcement Learning for Efficient Exploration [7.83845308102632]
We introduce LOPE: Learning Online with trajectory Preference guidancE, an end-to-end preference-guided RL framework.
Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance.
LOPE outperforms several state-of-the-art methods regarding convergence rate and overall performance.
arXiv Detail & Related papers (2024-07-09T02:11:12Z) - Hindsight Preference Learning for Offline Preference-based Reinforcement Learning [22.870967604847458]
Offline preference-based reinforcement learning (RL) focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset.
We propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments.
Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets.
arXiv Detail & Related papers (2024-07-05T12:05:37Z) - Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning [81.69044784288005]
Iterative preference learning requires online annotated preference labels.
We study strategies to select worth-annotating response pairs for cost-efficient annotation.
arXiv Detail & Related papers (2024-06-25T06:49:16Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.