Related papers: Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

URL: http://arxiv.org/abs/2601.22548v2
Date: Tue, 03 Feb 2026 21:37:46 GMT
Title: Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations
Authors: Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Mackenzie Puig-Hall, Narmeen Oozeer,
Abstract summary: We show that evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves.<n>We introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model.
Score: 3.262230127283452
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research has shown that large language models (LLMs) favor their own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.

Related papers

The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation [17.386684382460242]
Large language models (LLMs) are increasingly used to evaluate system outputs in tasks such as reasoning, question answering, and creative writing.<n>We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts for six judge models.<n>We study six cue families: source, temporal, age, gender, ethnicity, and educational status.
arXiv Detail & Related papers (2026-02-08T14:45:23Z)
Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models [55.94503936470247]
Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including judges.<n>Most classical methods assume annotators are conditionally independent given the true label $Yin0,1$, an assumption often violated by LLM judges.<n>We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors.
arXiv Detail & Related papers (2026-01-29T21:26:50Z)
Mitigating Self-Preference by Authorship Obfuscation [7.267505038291745]
Language models (LMs) judges are widely used to evaluate the quality of LM outputs.<n>Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations.<n>One such bias is self-preference: LM judges prefer their own answers over those produced by other LMs or humans.
arXiv Detail & Related papers (2025-12-05T02:36:13Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers [63.99316853136304]
Mirror-Critique is a framework that trains a verifier with informative critiques.<n>We deploy a small instruction-tuned model to synthesize high-quality critique data.<n>The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution.
arXiv Detail & Related papers (2025-09-27T06:50:24Z)
Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge [17.40713507922006]
Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other outputs.<n>LLMs may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias.<n>We present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated.
arXiv Detail & Related papers (2025-08-08T21:22:12Z)
Beyond the Surface: Measuring Self-Preference in LLM Judgments [35.66285592603435]
Studies show that large language models (LLMs) exhibit self-preference bias when serving as judges.<n>Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models.<n>We propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments.
arXiv Detail & Related papers (2025-06-03T08:12:47Z)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.83088028268318]
This paper introduces the Judge Evaluation for Test-Time Scaling benchmark.<n>It evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings.<n>Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures.
arXiv Detail & Related papers (2025-04-21T17:33:23Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Do LLM Evaluators Prefer Themselves for a Reason? [23.007963281858792]
Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement.<n>Prior work highlights a potential self-preference bias where LLMs favor their own generated responses.<n>This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models?
arXiv Detail & Related papers (2025-04-04T18:09:23Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models [15.781930031346105]
Self-reflection enhances performance in TruthfulQA, but adversely affects results in HotpotQA. We find that self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. Based on our findings, we propose guidelines for decisions on when to implement self-reflection.
arXiv Detail & Related papers (2024-04-14T02:47:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.