Related papers: LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations

LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations

URL: http://arxiv.org/abs/2504.19076v1
Date: Sun, 27 Apr 2025 02:14:21 GMT
Title: LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Authors: Laura Dietz, Oleg Zendel, Peter Bailey, Charles Clarke, Ellese Cotterill, Jeff Dalton, Faegheh Hasibi, Mark Sanderson, Nick Craswell,
Abstract summary: Large Language Models (LLMs) are increasingly used to evaluate information systems.<n>Recent studies suggest that LLM-based evaluations often align with human judgments.<n>This paper examines scenarios where LLM-evaluators may falsely indicate success.
Score: 29.031539043555362
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly used to evaluate information retrieval (IR) systems, generating relevance judgments traditionally made by human assessors. Recent empirical studies suggest that LLM-based evaluations often align with human judgments, leading some to suggest that human judges may no longer be necessary, while others highlight concerns about judgment reliability, validity, and long-term impact. As IR systems begin incorporating LLM-generated signals, evaluation outcomes risk becoming self-reinforcing, potentially leading to misleading conclusions. This paper examines scenarios where LLM-evaluators may falsely indicate success, particularly when LLM-based judgments influence both system development and evaluation. We highlight key risks, including bias reinforcement, reproducibility challenges, and inconsistencies in assessment methodologies. To address these concerns, we propose tests to quantify adverse effects, guardrails, and a collaborative framework for constructing reusable test collections that integrate LLM judgments responsibly. By providing perspectives from academia and industry, this work aims to establish best practices for the principled use of LLMs in IR evaluation.

Related papers

An Empirical Analysis of Uncertainty in Large Language Model Evaluations [28.297464655099034]
We conduct experiments involving 9 widely used LLM evaluators across 2 different evaluation settings.<n>We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes.<n>We find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent.
arXiv Detail & Related papers (2025-02-15T07:45:20Z)
LLM-based relevance assessment still can't replace human relevance assessment [12.829823535454505]
Recent studies suggest that large language models (LLMs) for relevance assessment in information retrieval provide comparable evaluations to human judgments.<n>Upadhyay et al. claim that LLM-based relevance assessments can fully replace traditional human relevance assessments in TREC-style evaluations.<n>This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion.
arXiv Detail & Related papers (2024-12-22T20:45:15Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates [11.948519516797745]
We develop an open-source framework to evaluate, compare, and visualize the reliability and alignment of LLM judges.<n>Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.
arXiv Detail & Related papers (2024-08-23T11:49:01Z)
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations [35.12731651234186]
Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities. We systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
arXiv Detail & Related papers (2024-07-04T17:15:37Z)
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom [19.104850413126066]
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs) Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers. We propose FedEval-LLM that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools.
arXiv Detail & Related papers (2024-04-18T15:46:26Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks. In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning) We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.