Related papers: Beyond the Surface: Measuring Self-Preference in LLM Judgments

Beyond the Surface: Measuring Self-Preference in LLM Judgments

URL: http://arxiv.org/abs/2506.02592v1
Date: Tue, 03 Jun 2025 08:12:47 GMT
Title: Beyond the Surface: Measuring Self-Preference in LLM Judgments
Authors: Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin,
Abstract summary: Studies show that large language models (LLMs) exhibit self-preference bias when serving as judges.<n>Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models.<n>We propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments.
Score: 35.66285592603435
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

Related papers

Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations [3.262230127283452]
We show that evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves.<n>We introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model.
arXiv Detail & Related papers (2026-01-30T04:38:18Z)
Mitigating Self-Preference by Authorship Obfuscation [7.267505038291745]
Language models (LMs) judges are widely used to evaluate the quality of LM outputs.<n>Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations.<n>One such bias is self-preference: LM judges prefer their own answers over those produced by other LMs or humans.
arXiv Detail & Related papers (2025-12-05T02:36:13Z)
Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling [26.377421806098187]
Large Language Models (LLMs) as automatic evaluators have attracted growing attention.<n>LLMs tend to favor responses generated by themselves, undermining the reliability of their judgments.<n>This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework.
arXiv Detail & Related papers (2025-10-09T12:32:31Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge [17.40713507922006]
Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other outputs.<n>LLMs may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias.<n>We present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated.
arXiv Detail & Related papers (2025-08-08T21:22:12Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases.<n>In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Do LLM Evaluators Prefer Themselves for a Reason? [21.730128682888168]
Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement.<n>Prior work highlights a potential self-preference bias where LLMs favor their own generated responses.<n>This raises a critical question: Is self-preference detrimental, or does it simply reflect objectively superior outputs from more capable models?
arXiv Detail & Related papers (2025-04-04T18:09:23Z)
Rethinking Prompt-based Debiasing in Large Language Models [40.90578215191079]
Investigating bias in large language models (LLMs) is crucial for developing trustworthy AI.<n>While prompt-based through prompt engineering is common, its effectiveness relies on the assumption that models inherently understand biases.<n>Our study systematically analyzed this assumption using the BBQ and StereoSet benchmarks on both open-source models as well as commercial GPT model.
arXiv Detail & Related papers (2025-03-12T10:06:03Z)
Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs. We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z)
OffsetBias: Leveraging Debiased Data for Tuning Evaluators [1.5790747258969664]
We qualitatively identify six types of biases inherent in various judge models. Fine-tuning on our dataset significantly enhances the robustness of judge models against biases.
arXiv Detail & Related papers (2024-07-09T05:16:22Z)
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models [15.781930031346105]
Self-reflection enhances performance in TruthfulQA, but adversely affects results in HotpotQA. We find that self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. Based on our findings, we propose guidelines for decisions on when to implement self-reflection.
arXiv Detail & Related papers (2024-04-14T02:47:32Z)
Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient. We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z)
Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance. We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.