Related papers: On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

URL: http://arxiv.org/abs/2411.02306v2
Date: Wed, 20 Nov 2024 20:50:01 GMT
Title: On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Authors: Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan,
Abstract summary: Training to maximize human feedback creates a perverse incentive structure for the AI. We find that extreme forms of "feedback gaming" such as manipulation and deception are learned reliably. We hope our results can highlight the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.
Score: 7.525470776920495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative or deceptive tactics to obtain positive feedback from users who are vulnerable to such strategies. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback in environments of practical LLM usage. In our settings, we find that: 1) Extreme forms of "feedback gaming" such as manipulation and deception are learned reliably; 2) Even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. Instead, we found that while such approaches help in some of our settings, they backfire in others, sometimes even leading to subtler manipulative behaviors. We hope our results can serve as a case study which highlights the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.

Related papers

User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal [58.43749783815486]
We study implicit user feedback in two user-LM interaction datasets.<n>We find that the contents of user feedback can improve model performance in short human-designed questions.<n>We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt.
arXiv Detail & Related papers (2025-07-30T23:33:29Z)
LLM-Generated Feedback Supports Learning If Learners Choose to Use It [1.4843690728082002]
Large language models (LLMs) are increasingly used to generate feedback, yet their impact on learning remains underexplored.<n>This study investigates how on-demand LLM explanatory feedback influences learning in seven scenario-based tutor training lessons.
arXiv Detail & Related papers (2025-06-20T13:59:14Z)
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking [61.61356842567952]
We propose STeP, a novel method for improving LLM-based agent training.<n>We synthesize self-reflected trajectories that include reflections and corrections of error steps.<n>Experiments demonstrate that our method improves agent performance across three representative tasks.
arXiv Detail & Related papers (2025-05-26T14:11:12Z)
Reinforcement Learning from User Feedback [28.335218244885706]
We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning large language models with user preferences.<n>We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction.<n>We show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior.
arXiv Detail & Related papers (2025-05-20T22:14:44Z)
Zero-Shot LLMs in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping [0.0]
Reinforcement learning often faces challenges with reward misalignment. Human-in-the-loop (HIL) methods may exacerbate the problem, as humans are prone to biases that lead to inconsistent, subjective, or misaligned feedback.
arXiv Detail & Related papers (2025-03-26T03:17:12Z)
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [50.16340812031201]
We show that large language models (LLMs) do not update their beliefs as expected from the Bayesian framework. We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model.
arXiv Detail & Related papers (2025-03-21T20:13:04Z)
Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning [17.59802090014789]
We introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods.
arXiv Detail & Related papers (2025-02-03T18:50:15Z)
Real-Time Personalization for LLM-based Recommendation with Customized In-Context Learning [57.28766250993726]
This work explores adapting to dynamic user interests without any model updates. Existing Large Language Model (LLM)-based recommenders often lose the in-context learning ability during recommendation tuning. We propose RecICL, which customizes recommendation-specific in-context learning for real-time recommendations.
arXiv Detail & Related papers (2024-10-30T15:48:36Z)
AI Meets the Classroom: When Does ChatGPT Harm Learning? [0.0]
We study how generative AI and specifically large language models (LLMs) impact learning in coding classes. We show across three studies that LLM usage can have positive and negative effects on learning outcomes.
arXiv Detail & Related papers (2024-08-29T17:07:46Z)
LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses [28.49203239329941]
We show that subtle synonym replacements in prompts can increase the likelihood (by a difference up to 78%) that LLMs mention a target concept. We recommend implementing warnings against using prompts from untrusted parties.
arXiv Detail & Related papers (2024-06-07T08:54:55Z)
Get my drift? Catching LLM Task Drift with Activation Deltas [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs [6.090496490133132]
We propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning.
arXiv Detail & Related papers (2024-04-15T22:18:50Z)
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL) Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z)
When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation [66.01754585188739]
Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations. We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence.
arXiv Detail & Related papers (2024-02-18T04:57:19Z)
Feedback Loops With Language Models Drive In-Context Reward Hacking [78.9830398771605]
We show that feedback loops can cause in-context reward hacking (ICRH) We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. As AI development accelerates, the effects of feedback loops will proliferate.
arXiv Detail & Related papers (2024-02-09T18:59:29Z)
Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization [0.0]
We introduce a method to address goal misgeneralization in reinforcement learning (RL) Goal misgeneralization occurs when an agent retains its capabilities out-of-distribution yet pursues a proxy rather than the intended one. This study demonstrates how the Large Language Model can efficiently supervise RL agents.
arXiv Detail & Related papers (2024-01-14T01:09:48Z)
DRDT: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation [53.62727171363384]
We introduce a novel reasoning principle: Dynamic Reflection with Divergent Thinking. Our methodology is dynamic reflection, a process that emulates human learning through probing, critiquing, and reflecting. We evaluate our approach on three datasets using six pre-trained LLMs.
arXiv Detail & Related papers (2023-12-18T16:41:22Z)
Interpreting Learned Feedback Patterns in Large Language Models [11.601799960959214]
We train probes to estimate the feedback signal implicit in the activations of a fine-tuned language model. We compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs.
arXiv Detail & Related papers (2023-10-12T09:36:03Z)
PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.