Beyond the Surface: Measuring Self-Preference in LLM Judgments
- URL: http://arxiv.org/abs/2506.02592v1
- Date: Tue, 03 Jun 2025 08:12:47 GMT
- Title: Beyond the Surface: Measuring Self-Preference in LLM Judgments
- Authors: Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin,
- Abstract summary: Studies show that large language models (LLMs) exhibit self-preference bias when serving as judges.<n>Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models.<n>We propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments.
- Score: 35.66285592603435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
Related papers
- CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z) - Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases.<n>In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Do LLM Evaluators Prefer Themselves for a Reason? [21.730128682888168]
Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement.<n>Prior work highlights a potential self-preference bias where LLMs favor their own generated responses.<n>This raises a critical question: Is self-preference detrimental, or does it simply reflect objectively superior outputs from more capable models?
arXiv Detail & Related papers (2025-04-04T18:09:23Z) - Rethinking Prompt-based Debiasing in Large Language Models [40.90578215191079]
Investigating bias in large language models (LLMs) is crucial for developing trustworthy AI.<n>While prompt-based through prompt engineering is common, its effectiveness relies on the assumption that models inherently understand biases.<n>Our study systematically analyzed this assumption using the BBQ and StereoSet benchmarks on both open-source models as well as commercial GPT model.
arXiv Detail & Related papers (2025-03-12T10:06:03Z) - Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs.
We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective.
Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z) - OffsetBias: Leveraging Debiased Data for Tuning Evaluators [1.5790747258969664]
We qualitatively identify six types of biases inherent in various judge models.
Fine-tuning on our dataset significantly enhances the robustness of judge models against biases.
arXiv Detail & Related papers (2024-07-09T05:16:22Z) - When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models [15.781930031346105]
Self-reflection enhances performance in TruthfulQA, but adversely affects results in HotpotQA.
We find that self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher.
Based on our findings, we propose guidelines for decisions on when to implement self-reflection.
arXiv Detail & Related papers (2024-04-14T02:47:32Z) - Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z) - Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models.
Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance.
We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.