When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment
- URL: http://arxiv.org/abs/2602.17170v2
- Date: Fri, 20 Feb 2026 23:31:40 GMT
- Title: When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment
- Authors: Chuting Yu, Hang Li, Guido Zuccon, Joel Mackenzie, Teerapong Leelanupab,
- Abstract summary: Large language models (LLMs) can be used as proxies for human judges.<n>We show that models consistently assign inflated relevance scores to passages that do not genuinely satisfy the underlying information need.<n>Experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues.
- Score: 29.603396943658428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.
Related papers
- Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis [4.719505127252616]
Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation.<n>We aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average.<n>We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space.
arXiv Detail & Related papers (2026-01-05T03:02:33Z) - On Evaluating LLM Alignment by Evaluating LLMs as Judges [68.15541137648721]
evaluating large language models' (LLMs) alignment requires them to be helpful, honest, safe, and to precisely follow human instructions.<n>We examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences.<n>We propose a benchmark that assesses alignment without directly evaluating model outputs.
arXiv Detail & Related papers (2025-11-25T18:33:24Z) - Real-World Summarization: When Evaluation Reaches Its Limits [1.4197924572122094]
We compare traditional metrics, trainable methods, and LLM-as-a-judge approaches.<n>Our findings reveal that simpler metrics like word overlap surprisingly well with human judgments.<n>Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks.
arXiv Detail & Related papers (2025-07-15T17:23:56Z) - LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations [29.031539043555362]
Large Language Models (LLMs) are increasingly used to evaluate information systems.<n>Recent studies suggest that LLM-based evaluations often align with human judgments.<n>This paper examines scenarios where LLM-evaluators may falsely indicate success.
arXiv Detail & Related papers (2025-04-27T02:14:21Z) - LLM-based relevance assessment still can't replace human relevance assessment [12.829823535454505]
Recent studies suggest that large language models (LLMs) for relevance assessment in information retrieval provide comparable evaluations to human judgments.<n>Upadhyay et al. claim that LLM-based relevance assessments can fully replace traditional human relevance assessments in TREC-style evaluations.<n>This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion.
arXiv Detail & Related papers (2024-12-22T20:45:15Z) - Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation [2.9180406633632523]
Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment.<n>Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements.<n>We look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility.
We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge.
Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z) - LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments.<n>We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data.<n>We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.<n>LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.<n>We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.