Related papers: The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

URL: http://arxiv.org/abs/2509.26072v2
Date: Tue, 14 Oct 2025 08:41:30 GMT
Title: The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
Authors: Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah,
Abstract summary: Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing.<n>We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt.
Score: 17.555073770285095
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.

Related papers

The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation [17.386684382460242]
Large language models (LLMs) are increasingly used to evaluate system outputs in tasks such as reasoning, question answering, and creative writing.<n>We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts for six judge models.<n>We study six cue families: source, temporal, age, gender, ethnicity, and educational status.
arXiv Detail & Related papers (2026-02-08T14:45:23Z)
Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation [89.52571224447111]
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
arXiv Detail & Related papers (2026-02-07T19:39:28Z)
Are We on the Right Way to Assessing LLM-as-a-Judge? [16.32248269615178]
We introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without requiring human annotation.<n>Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency and global logical consistency.<n>Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings.
arXiv Detail & Related papers (2025-12-17T23:49:55Z)
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems [32.83708359216193]
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems.<n>This paper systematically investigates judgment biases in two LLM-as-a-judge models under the point-wise scoring setting.<n>We propose four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
arXiv Detail & Related papers (2025-10-14T12:52:29Z)
Hearing the Order: Investigating Selection Bias in Large Audio-Language Models [51.69003519291754]
Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options.<n>In this paper, we identify and analyze this problem in LALMs.
arXiv Detail & Related papers (2025-10-01T08:00:58Z)
Quantitative LLM Judges [48.676042957523045]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain.<n>The models are trained to improve the score of the original judge by using the judge's textual evaluation and score.<n>Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z)
Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation [14.521056434373213]
Using large language models as evaluators has expanded to code evaluation tasks.<n>This raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations?<n>We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation.
arXiv Detail & Related papers (2025-05-22T04:49:33Z)
Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation [19.66750942418172]
Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All.<n>In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes.<n>Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases.
arXiv Detail & Related papers (2025-03-29T04:36:25Z)
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge [32.55871325700294]
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP)<n>Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm.
arXiv Detail & Related papers (2024-11-25T17:28:44Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction. This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z)
The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust. It asks necessary questions to decide when an LLM should refine its output. It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.