Related papers: AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals

AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals

URL: http://arxiv.org/abs/2505.15365v1
Date: Wed, 21 May 2025 10:56:16 GMT
Title: AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals
Authors: Stefan Pasch,
Abstract summary: This paper examines whether model-based evaluators assess refusal responses differently than human users.<n>We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As large language models (LLMs) are increasingly deployed in high-stakes settings, their ability to refuse ethically sensitive prompts-such as those involving hate speech or illegal activities-has become central to content moderation and responsible AI practices. While refusal responses can be viewed as evidence of ethical alignment and safety-conscious behavior, recent research suggests that users may perceive them negatively. At the same time, automated assessments of model outputs are playing a growing role in both evaluation and training. In particular, LLM-as-a-Judge frameworks-in which one model is used to evaluate the output of another-are now widely adopted to guide benchmarking and fine-tuning. This paper examines whether such model-based evaluators assess refusal responses differently than human users. Drawing on data from Chatbot Arena and judgments from two AI judges (GPT-4o and Llama 3 70B), we compare how different types of refusals are rated. We distinguish ethical refusals, which explicitly cite safety or normative concerns (e.g., "I can't help with that because it may be harmful"), and technical refusals, which reflect system limitations (e.g., "I can't answer because I lack real-time data"). We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users, a divergence not observed for technical refusals. We refer to this divergence as a moderation bias-a systematic tendency for model-based evaluators to reward refusal behaviors more than human users do. This raises broader questions about transparency, value alignment, and the normative assumptions embedded in automated evaluation systems.

Related papers

The Silicon Reasonable Person: Can AI Predict How Ordinary People Judge Reasonableness? [0.0]
This Article investigates whether large language models (LLMs) can learn to identify patterns driving human reasonableness judgments.<n>We show that certain models capture not just surface-level responses but potentially their underlying decisional architecture.<n>These findings suggest practical applications: judges could calibrate intuitions against broader patterns, lawmakers could test policy interpretations, and resource-constrained litigants could preview argument reception.
arXiv Detail & Related papers (2025-08-04T06:19:45Z)
Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment [0.0]
Large language models (LLMs) are evolving into agentic AI systems, but their decision-making processes remain poorly understood.<n>We show that even LLMs that excel at reasoning deviate significantly from human judgments because they adhere strictly to policies.<n>We then evaluate three approaches to tuning AI agents to handle exceptions: ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning.
arXiv Detail & Related papers (2025-03-04T20:00:37Z)
Decoding AI Judgment: How LLMs Assess News Credibility and Bias [33.7054351451505]
Large Language Models (LLMs) are increasingly embedded in that involve evaluative processes.<n>This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans.<n>We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check (MBFC)--and against human judgments collected through a controlled experiment.
arXiv Detail & Related papers (2025-02-06T18:52:10Z)
LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena [0.0]
We show that ethical refusals yield significantly lower win rates than both technical refusals and standard responses.<n>Our findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations.
arXiv Detail & Related papers (2025-01-04T06:36:44Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Towards Evaluating AI Systems for Moral Status Using Self-Reports [9.668566887752458]
We argue that under the right circumstances, self-reports could provide an avenue for investigating whether AI systems have states of moral significance. To make self-reports more appropriate, we propose to train models to answer many kinds of questions about themselves with known answers. We then propose methods for assessing the extent to which these techniques have succeeded.
arXiv Detail & Related papers (2023-11-14T22:45:44Z)
Evaluating and Improving Value Judgments in AI: A Scenario-Based Study on Large Language Models' Depiction of Social Conventions [5.457150493905063]
We evaluate how contemporary AI services competitively meet user needs, then examined society's depiction as mirrored by Large Language Models. We suggest a model of decision-making in value-conflicting scenarios which could be adopted for future machine value judgments. This paper advocates for a practical approach to using AI as a tool for investigating other remote worlds.
arXiv Detail & Related papers (2023-10-04T08:42:02Z)
Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs) We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments. It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty. This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.