Do LLM Evaluators Prefer Themselves for a Reason?
- URL: http://arxiv.org/abs/2504.03846v2
- Date: Fri, 31 Oct 2025 18:58:47 GMT
- Title: Do LLM Evaluators Prefer Themselves for a Reason?
- Authors: Wei-Lin Chen, Zhepei Wei, Xinyu Zhu, Shi Feng, Yu Meng,
- Abstract summary: Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement.<n>Prior work highlights a potential self-preference bias where LLMs favor their own generated responses.<n>This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models?
- Score: 23.007963281858792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models? Answering this has been difficult as previous studies relied primarily on subjective tasks. These tasks lack an objective ground truth, meaning that either preference can be reasonably justified. To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately. (2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference when they do err. This suggests stronger models struggle more to recognize when they are wrong. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.
Related papers
- Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations [3.262230127283452]
We show that evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves.<n>We introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model.
arXiv Detail & Related papers (2026-01-30T04:38:18Z) - Teaching Large Reasoning Models Effective Reflection [62.73646680747003]
Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks.<n>However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer.<n>We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self-generated critiques.
arXiv Detail & Related papers (2026-01-19T04:51:53Z) - Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection [71.8243083897721]
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability.<n>We present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training.
arXiv Detail & Related papers (2025-09-27T10:37:11Z) - Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators [3.07869141026886]
"Self-preference bias" undermines fairness and reliability in evaluation pipelines.<n>We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining.
arXiv Detail & Related papers (2025-09-03T18:52:55Z) - One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.
We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.
Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Self-Preference Bias in LLM-as-a-Judge [13.880151307013321]
We introduce a novel metric to measure the self-preference bias in large language models (LLMs)
Our results show GPT-4 exhibits a significant degree of self-preference bias.
This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
arXiv Detail & Related papers (2024-10-29T07:42:18Z) - Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.<n>In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process.<n>We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - LLM Evaluators Recognize and Favor Their Own Generations [33.672365386365236]
We investigate if self-recognition capability contributes to self-preference.
We find a linear correlation between self-recognition capability and the strength of self-preference bias.
We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.
arXiv Detail & Related papers (2024-04-15T16:49:59Z) - When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models [15.781930031346105]
Self-reflection enhances performance in TruthfulQA, but adversely affects results in HotpotQA.
We find that self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher.
Based on our findings, we propose guidelines for decisions on when to implement self-reflection.
arXiv Detail & Related papers (2024-04-14T02:47:32Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Human Feedback is not Gold Standard [28.63384327791185]
We critically analyse the use of human feedback for both training and evaluation.
We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality.
arXiv Detail & Related papers (2023-09-28T11:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.