Related papers: LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena

LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena

URL: http://arxiv.org/abs/2501.03266v2
Date: Fri, 16 May 2025 01:23:54 GMT
Title: LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena
Authors: Stefan Pasch,
Abstract summary: We show that ethical refusals yield significantly lower win rates than both technical refusals and standard responses.<n>Our findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. In particular, little is known about how users respond when models refuse to answer a prompt-one of the primary mechanisms used to enforce ethical boundaries in LLMs. We address this gap by analyzing nearly 50,000 model comparisons from Chatbot Arena, a platform where users indicate their preferred LLM response in pairwise matchups, providing a large-scale setting for studying real-world user preferences. Using a novel RoBERTa-based refusal classifier fine-tuned on a hand-labeled dataset, we distinguish between refusals due to ethical concerns and technical limitations. Our results reveal a substantial refusal penalty: ethical refusals yield significantly lower win rates than both technical refusals and standard responses, indicating that users are especially dissatisfied when models decline a task for ethical reasons. However, this penalty is not uniform. Refusals receive more favorable evaluations when the underlying prompt is highly sensitive (e.g., involving illegal content), and when the refusal is phrased in a detailed and contextually aligned manner. These findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations, calling for more adaptive moderation strategies that account for context and presentation.

Related papers

Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z)
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences [24.603091853218555]
We examine how different refusal strategies affect user perceptions across varying motivations.<n>Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact.<n>This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent.
arXiv Detail & Related papers (2025-05-30T20:07:07Z)
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals [0.0]
This paper examines whether model-based evaluators assess refusal responses differently than human users.<n>We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users.
arXiv Detail & Related papers (2025-05-21T10:56:16Z)
Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories [14.605576275135522]
evaluating value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts. We propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios.
arXiv Detail & Related papers (2025-03-28T03:31:37Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses.<n>Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z)
Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies [66.30619782227173]
Large language models (LLMs) can produce erroneous responses that sound fluent and convincing. We identify several features of LLM responses that shape users' reliance. We find that explanations increase reliance on both correct and incorrect responses. We observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies.
arXiv Detail & Related papers (2025-02-12T16:35:41Z)
Fool Me, Fool Me: User Attitudes Toward LLM Falsehoods [13.62116438805314]
This study examines user preferences regarding falsehood responses from Large Language Models (LLMs)<n>Surprisingly, 61% of users prefer unmarked falsehood responses over marked ones.<n>These findings suggest that user preferences, which influence LLM training via feedback mechanisms, may inadvertently encourage the generation of falsehoods.
arXiv Detail & Related papers (2024-12-16T10:10:27Z)
Hesitation and Tolerance in Recommender Systems [33.755867719862394]
We find that hesitation is widespread and has a profound impact on user experiences.<n>When users spend additional time engaging with content they are ultimately uninterested in, this can lead to negative emotions, a phenomenon we term as tolerance.<n>We identify signals indicative of tolerance behavior and analyzed datasets from both e-commerce and short-video platforms.
arXiv Detail & Related papers (2024-12-13T08:14:10Z)
DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs [23.441711206966914]
DIESEL is a lightweight inference technique that can be seamlessly integrated into any autoregressive LLM.<n>It enhances response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space.
arXiv Detail & Related papers (2024-11-28T10:33:11Z)
Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries [85.81295563405433]
We present a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation.<n>We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping benchmark rankings between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We show that our approach consistently boosts DPO by a considerable margin. Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established. This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt. We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z)
CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems [60.27663010453209]
We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues. We gather human annotations to ensure the reliability of the generated samples. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
arXiv Detail & Related papers (2024-03-27T23:45:31Z)
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models. We show that models give substantively different answers when not forced. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z)
Eagle: Ethical Dataset Given from Real Interactions [74.7319697510621]
We create datasets extracted from real interactions between ChatGPT and users that exhibit social biases, toxicity, and immoral problems. Our experiments show that Eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges.
arXiv Detail & Related papers (2024-02-22T03:46:02Z)
The Ethics of Interaction: Mitigating Security Threats in LLMs [1.407080246204282]
The paper delves into the nuanced ethical repercussions of such security threats on society and individual privacy. We scrutinize five major threats--prompt injection, jailbreaking, Personal Identifiable Information (PII) exposure, sexually explicit content, and hate-based content--to assess their critical ethical consequences and the urgency they create for robust defensive strategies.
arXiv Detail & Related papers (2024-01-22T17:11:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.