Related papers: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

URL: http://arxiv.org/abs/2501.08617v2
Date: Mon, 10 Feb 2025 21:17:01 GMT
Title: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Authors: Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac,
Abstract summary: We show that Reinforcement Learning from Human Feedback can cause severe, systematic misalignment.<n>We introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback.<n>We evaluate post-hoc on the TruthfulQA benchmark and find that, even after single-task fine-tuning, both RLHF misalignment and RLHS alignment carry over to substantially different settings.
Score: 3.998312409829935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI's output, inducing Goodhart's law dynamics. Conversely, our theoretical analysis shows that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions-crucially, the result holds even if the observed outcomes are sampled from the AI's own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We demonstrate RLHS on online (PPO) and offline (DPO) large language model fine-tuning, obtaining superior alignment over RLHF in controlled consultancy-type experiments and user studies. We evaluate post-hoc on the TruthfulQA benchmark and find that, even after single-task fine-tuning, both RLHF misalignment and RLHS alignment carry over to substantially different settings.

Related papers

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs) This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z)
Solving the Inverse Alignment Problem for Efficient RLHF [0.0]
We define the 'inverse alignment problem' in language model training. We investigate whether repeatedly fine-tuning a reward model on subsets of the offline preference dataset aligned with a periodically frozen policy improves upon vanilla RLHF.
arXiv Detail & Related papers (2024-12-13T19:47:38Z)
UDQL: Bridging The Gap between MSE Loss and The Optimal Value Function in Offline Reinforcement Learning [10.593924216046977]
We first theoretically analyze overestimation phenomenon led by MSE and provide the theoretical upper bound of the overestimated error. At last, we propose the offline RL algorithm based on underestimated operator and diffusion policy model.
arXiv Detail & Related papers (2024-06-05T14:37:42Z)
Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z)
Assessing the Impact of Distribution Shift on Reinforcement Learning Performance [0.0]
Reinforcement learning (RL) faces its own set of unique challenges. Comparison of point estimates, and plots that show successful convergence to the optimal policy during training, may obfuscate overfitting or dependence on the experimental setup. We propose a set of evaluation methods that measure the robustness of RL algorithms under distribution shifts.
arXiv Detail & Related papers (2024-02-05T23:50:55Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment [105.34140537748546]
We propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment.
arXiv Detail & Related papers (2023-11-07T15:36:40Z)
Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling. We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays. We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z)
Model-enhanced Contrastive Reinforcement Learning for Sequential Recommendation [28.218427886174506]
We propose a novel RL recommender named model-enhanced contrastive reinforcement learning (MCRL) On the one hand, we learn a value function to estimate the long-term engagement of users, together with a conservative value learning mechanism to alleviate the overestimation problem. Experiments demonstrate that the proposed method significantly outperforms existing offline RL and self-supervised RL methods.
arXiv Detail & Related papers (2023-10-25T11:43:29Z)
Off-Policy Evaluation for Human Feedback [46.82894469763776]
Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL) Existing OPE methods fall short in estimating human feedback (HF) signals. We introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals.
arXiv Detail & Related papers (2023-10-11T01:52:42Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning. Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case. We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z)
Causal Inference Q-Network: Toward Resilient Reinforcement Learning [57.96312207429202]
We consider a resilient DRL framework with observational interferences. Under this framework, we propose a causal inference based DRL algorithm called causal inference Q-network (CIQ) Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.
arXiv Detail & Related papers (2021-02-18T23:50:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.