Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning
- URL: http://arxiv.org/abs/2508.07126v2
- Date: Mon, 06 Oct 2025 20:55:39 GMT
- Title: Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning
- Authors: Zhengran Ji, Boyuan Chen,
- Abstract summary: We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data.<n>Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows.<n>Pref-GUIDE Voting enhances robustness by aggregating reward models across a population of users to form consensus preferences.
- Score: 3.513501558121131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.
Related papers
- Reward Modeling from Natural Language Human Feedback [77.75758630455357]
Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs)<n>In this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques.<n>We propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals.
arXiv Detail & Related papers (2026-01-12T09:23:43Z) - Rectifying Shortcut Behaviors in Preference-based Reward Learning [46.09046818725698]
In reinforcement learning, preference-based reward models play a central role in aligning large language models to human-aligned behavior.<n>Recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization.<n>We introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning.
arXiv Detail & Related papers (2025-10-21T20:08:32Z) - Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning [83.64755389431971]
We introduce a Search-Based Preference Weighting scheme to unify two feedback sources.<n>For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations.<n>These weights are then used to guide standard preference learning, enabling more accurate credit assignment.
arXiv Detail & Related papers (2025-08-21T07:41:45Z) - Learning from Natural Language Feedback for Personalized Question Answering [21.115495457454365]
Personalization is crucial for enhancing the effectiveness and user satisfaction of language technologies.<n>Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG)<n>We introduce Vac, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF)
arXiv Detail & Related papers (2025-08-14T14:36:53Z) - Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models [8.025808955214957]
This paper studies the advantages and limitations of reinforcement learning from large language model feedback.
We propose a simple yet effective method for soliciting and applying feedback as a potential-based shaping function.
arXiv Detail & Related papers (2024-10-22T19:52:08Z) - Online Bandit Learning with Offline Preference Data for Improved RLHF [15.799929216215672]
We propose a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback.<n>We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively.
arXiv Detail & Related papers (2024-06-13T20:25:52Z) - The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from
Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings.
In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z) - Iterative Reward Shaping using Human Feedback for Correcting Reward
Misspecification [15.453123084827089]
ITERS is an iterative reward shaping approach using human feedback for mitigating the effects of a misspecified reward function.
We evaluate ITERS in three environments and show that it can successfully correct misspecified reward functions.
arXiv Detail & Related papers (2023-08-30T11:45:40Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Generative Adversarial Reward Learning for Generalized Behavior Tendency
Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling.
Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.