When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2402.17747v5
- Date: Sun, 17 Nov 2024 12:18:45 GMT
- Title: When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
- Authors: Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons,
- Abstract summary: We show that when human feedback is based only on partial observations, it can result in deceptive inflation and overjustification.
We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity.
- Score: 16.540715313676994
- License:
- Abstract: Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges, experimentally validate both the theoretical concerns and potential mitigations, and caution against blindly applying RLHF in partially observable settings.
Related papers
- RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation [3.998312409829935]
We show that Reinforcement Learning from Human Feedback can cause severe, systematic misalignment.
We introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback.
We evaluate post-hoc on the TruthfulQA benchmark and find that, even after single-task fine-tuning, both RLHF misalignment and RLHS alignment carry over to substantially different settings.
arXiv Detail & Related papers (2025-01-15T06:33:15Z) - Understanding Impact of Human Feedback via Influence Functions [25.467337374024197]
In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback.
Human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses.
We propose a compute-efficient approximation method to measure the impact of human feedback on the performance of reward models.
arXiv Detail & Related papers (2025-01-10T08:50:38Z) - Observation Interference in Partially Observable Assistance Games [34.53170543153206]
We study a model of the human-AI value alignment problem which allows the human and the AI assistant to have partial observations.
We show that sometimes an optimal assistant must take observation-interfering actions, even when the human is playing optimally.
We show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations.
arXiv Detail & Related papers (2024-12-23T18:53:33Z) - Towards Understanding Sycophancy in Language Models [49.99654432561934]
We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback.
We show that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks.
Our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
arXiv Detail & Related papers (2023-10-20T14:46:48Z) - Understanding Robust Overfitting from the Feature Generalization Perspective [61.770805867606796]
Adversarial training (AT) constructs robust neural networks by incorporating adversarial perturbations into natural data.
It is plagued by the issue of robust overfitting (RO), which severely damages the model's robustness.
In this paper, we investigate RO from a novel feature generalization perspective.
arXiv Detail & Related papers (2023-10-01T07:57:03Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards
Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent.
Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally.
We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z) - On the Interaction of Belief Bias and Explanations [4.211128681972148]
We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it.
We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of accounting for belief bias in evaluation.
arXiv Detail & Related papers (2021-06-29T12:49:42Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z) - Improving Factual Consistency Between a Response and Persona Facts [64.30785349238619]
Neural models for response generation produce responses that are semantically plausible but not necessarily factually consistent with facts describing the speaker's persona.
We propose to fine-tune these models by reinforcement learning and an efficient reward function that explicitly captures the consistency between a response and persona facts as well as semantic plausibility.
arXiv Detail & Related papers (2020-04-30T18:08:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.