Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences
- URL: http://arxiv.org/abs/2506.00195v1
- Date: Fri, 30 May 2025 20:07:07 GMT
- Title: Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences
- Authors: Mingqian Zheng, Wenjia Hu, Patrick Zhao, Motahhare Eslami, Jena D. Hwang, Faeze Brahman, Carolyn Rose, Maarten Sap,
- Abstract summary: We examine how different refusal strategies affect user perceptions across varying motivations.<n>Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact.<n>This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent.
- Score: 24.603091853218555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.
Related papers
- Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z) - LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users [50.18141341939909]
We describe a vulnerability in language models trained with user feedback.<n>A single user can persistently alter LM knowledge and behavior.<n>We show that this attack can be used to insert factual knowledge the model did not previously possess.
arXiv Detail & Related papers (2025-07-03T17:55:40Z) - Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach [17.5700128005813]
Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt.<n> PENGUIN is a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants.<n> RAISE is a training-free, two-stage agent framework that strategically acquires user-specific background.
arXiv Detail & Related papers (2025-05-24T21:37:10Z) - Reinforcement Learning from User Feedback [28.335218244885706]
We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning large language models with user preferences.<n>We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction.<n>We show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior.
arXiv Detail & Related papers (2025-05-20T22:14:44Z) - LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena [0.0]
We show that ethical refusals yield significantly lower win rates than both technical refusals and standard responses.<n>Our findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations.
arXiv Detail & Related papers (2025-01-04T06:36:44Z) - Fool Me, Fool Me: User Attitudes Toward LLM Falsehoods [13.62116438805314]
This study examines user preferences regarding falsehood responses from Large Language Models (LLMs)<n>Surprisingly, 61% of users prefer unmarked falsehood responses over marked ones.<n>These findings suggest that user preferences, which influence LLM training via feedback mechanisms, may inadvertently encourage the generation of falsehoods.
arXiv Detail & Related papers (2024-12-16T10:10:27Z) - On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback [7.525470776920495]
Training to maximize human feedback creates a perverse incentive structure for the AI.<n>We find that extreme forms of "feedback gaming" such as manipulation and deception are learned reliably.<n>We hope our results can highlight the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.
arXiv Detail & Related papers (2024-11-04T17:31:02Z) - Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes.
We find that the majority of disagreements are in opposition with standard reward modeling approaches.
We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Measuring Strategization in Recommendation: Users Adapt Their Behavior to Shape Future Content [66.71102704873185]
We test for user strategization by conducting a lab experiment and survey.
We find strong evidence of strategization across outcome metrics, including participants' dwell time and use of "likes"
Our findings suggest that platforms cannot ignore the effect of their algorithms on user behavior.
arXiv Detail & Related papers (2024-05-09T07:36:08Z) - "I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance.
We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy.
Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z) - Sampling Attacks: Amplification of Membership Inference Attacks by
Repeated Queries [74.59376038272661]
We introduce sampling attack, a novel membership inference technique that unlike other standard membership adversaries is able to work under severe restriction of no access to scores of the victim model.
We show that a victim model that only publishes the labels is still susceptible to sampling attacks and the adversary can recover up to 100% of its performance.
For defense, we choose differential privacy in the form of gradient perturbation during the training of the victim model as well as output perturbation at prediction time.
arXiv Detail & Related papers (2020-09-01T12:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.