Mitigating Self-Preference by Authorship Obfuscation
- URL: http://arxiv.org/abs/2512.05379v1
- Date: Fri, 05 Dec 2025 02:36:13 GMT
- Title: Mitigating Self-Preference by Authorship Obfuscation
- Authors: Taslim Mahbub, Shi Feng,
- Abstract summary: Language models (LMs) judges are widely used to evaluate the quality of LM outputs.<n>Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations.<n>One such bias is self-preference: LM judges prefer their own answers over those produced by other LMs or humans.
- Score: 7.267505038291745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and complete mitigation remains challenging despite promising initial results.
Related papers
- Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation [89.52571224447111]
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
arXiv Detail & Related papers (2026-02-07T19:39:28Z) - Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations [3.262230127283452]
We show that evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves.<n>We introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model.
arXiv Detail & Related papers (2026-01-30T04:38:18Z) - Quantitative LLM Judges [60.773734899532336]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z) - Beyond the Surface: Measuring Self-Preference in LLM Judgments [35.66285592603435]
Studies show that large language models (LLMs) exhibit self-preference bias when serving as judges.<n>Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models.<n>We propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments.
arXiv Detail & Related papers (2025-06-03T08:12:47Z) - Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Do LLM Evaluators Prefer Themselves for a Reason? [23.007963281858792]
Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement.<n>Prior work highlights a potential self-preference bias where LLMs favor their own generated responses.<n>This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models?
arXiv Detail & Related papers (2025-04-04T18:09:23Z) - Self-Preference Bias in LLM-as-a-Judge [13.880151307013321]
We introduce a novel metric to measure the self-preference bias in large language models (LLMs)<n>Our results show GPT-4 exhibits a significant degree of self-preference bias.<n>This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
arXiv Detail & Related papers (2024-10-29T07:42:18Z) - JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - LLM Evaluators Recognize and Favor Their Own Generations [33.672365386365236]
We investigate if self-recognition capability contributes to self-preference.
We find a linear correlation between self-recognition capability and the strength of self-preference bias.
We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.
arXiv Detail & Related papers (2024-04-15T16:49:59Z) - When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models [15.781930031346105]
Self-reflection enhances performance in TruthfulQA, but adversely affects results in HotpotQA.
We find that self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher.
Based on our findings, we propose guidelines for decisions on when to implement self-reflection.
arXiv Detail & Related papers (2024-04-14T02:47:32Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.