Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
- URL: http://arxiv.org/abs/2508.18076v2
- Date: Wed, 27 Aug 2025 21:24:13 GMT
- Title: Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
- Authors: Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, Golnoosh Farnadi,
- Abstract summary: Large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored.<n>This paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators.
- Score: 23.16086453334644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.
Related papers
- Learning to Judge: LLMs Designing and Applying Evaluation Rubrics [18.936553687978087]
Large language models (LLMs) are increasingly used as evaluators for natural language generation.<n>We introduce GER-Eval to investigate whether LLMs can design and apply their own evaluation rubrics.
arXiv Detail & Related papers (2026-02-09T13:56:06Z) - Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations.<n>Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation.<n>We propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications.
arXiv Detail & Related papers (2025-04-01T09:36:56Z) - A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis [1.5802986215292303]
We propose a novel benchmark that evaluates large language models (LLMs) using n-gram statistics and rules.<n>Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness.<n>Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources.
arXiv Detail & Related papers (2025-02-13T13:30:54Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.<n>It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.<n>We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility.
We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge.
Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z) - Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.