Related papers: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

URL: http://arxiv.org/abs/2501.08617v3
Date: Tue, 10 Jun 2025 03:19:33 GMT
Title: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Authors: Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac,
Abstract summary: We show that Reinforcement Learning from Human Feedback can cause severe, systematic misalignment.<n>We introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback.<n>RLHS substantially improves alignment over RLHF in experiments and human evaluations.
Score: 3.998312409829935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI's output, inducing Goodhart's law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions--crucially, the result holds even if the observed outcomes are sampled from the AI's own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings--marketplace interactions, restaurant recommendations, and online course advising--using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, whereas RLHS consistently outperforms baselines and demonstrates robust alignment generalization. The project webpage and code are available at https://rl-hindsight.github.io.

Related papers

Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning [55.36978389831446]
We recast reflective exploration within the Bayes-Adaptive RL framework.<n>Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on observed outcomes.
arXiv Detail & Related papers (2025-05-26T22:51:00Z)
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs) This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z)
Solving the Inverse Alignment Problem for Efficient RLHF [0.0]
We define the 'inverse alignment problem' in language model training. We investigate whether repeatedly fine-tuning a reward model on subsets of the offline preference dataset aligned with a periodically frozen policy improves upon vanilla RLHF.
arXiv Detail & Related papers (2024-12-13T19:47:38Z)
On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation [47.35777964373532]
Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation.<n>Previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback.<n>We present RLFH, an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior.
arXiv Detail & Related papers (2024-06-18T02:43:49Z)
Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions. We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z)
UDQL: Bridging The Gap between MSE Loss and The Optimal Value Function in Offline Reinforcement Learning [10.593924216046977]
We first theoretically analyze overestimation phenomenon led by MSE and provide the theoretical upper bound of the overestimated error. At last, we propose the offline RL algorithm based on underestimated operator and diffusion policy model.
arXiv Detail & Related papers (2024-06-05T14:37:42Z)
Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z)
Assessing the Impact of Distribution Shift on Reinforcement Learning Performance [0.0]
Reinforcement learning (RL) faces its own set of unique challenges. Comparison of point estimates, and plots that show successful convergence to the optimal policy during training, may obfuscate overfitting or dependence on the experimental setup. We propose a set of evaluation methods that measure the robustness of RL algorithms under distribution shifts.
arXiv Detail & Related papers (2024-02-05T23:50:55Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning [54.48409201256968]
offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples.
arXiv Detail & Related papers (2023-12-12T19:24:35Z)
Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment [105.34140537748546]
We propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment.
arXiv Detail & Related papers (2023-11-07T15:36:40Z)
Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling. We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays. We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z)
Model-enhanced Contrastive Reinforcement Learning for Sequential Recommendation [28.218427886174506]
We propose a novel RL recommender named model-enhanced contrastive reinforcement learning (MCRL) On the one hand, we learn a value function to estimate the long-term engagement of users, together with a conservative value learning mechanism to alleviate the overestimation problem. Experiments demonstrate that the proposed method significantly outperforms existing offline RL and self-supervised RL methods.
arXiv Detail & Related papers (2023-10-25T11:43:29Z)
Off-Policy Evaluation for Human Feedback [46.82894469763776]
Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL) Existing OPE methods fall short in estimating human feedback (HF) signals. We introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals.
arXiv Detail & Related papers (2023-10-11T01:52:42Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning. Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case. We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z)
Causal Inference Q-Network: Toward Resilient Reinforcement Learning [57.96312207429202]
We consider a resilient DRL framework with observational interferences. Under this framework, we propose a causal inference based DRL algorithm called causal inference Q-network (CIQ) Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.
arXiv Detail & Related papers (2021-02-18T23:50:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.