Validating LLM-as-a-Judge Systems in the Absence of Gold Labels
- URL: http://arxiv.org/abs/2503.05965v2
- Date: Tue, 11 Mar 2025 21:21:35 GMT
- Title: Validating LLM-as-a-Judge Systems in the Absence of Gold Labels
- Authors: Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova,
- Abstract summary: We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes.<n>We demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches.
- Score: 70.73669293078916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.
Related papers
- Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z) - Tuning LLM Judge Design Decisions for 1/1000 of the Cost [42.06346155380305]
Large Language Models (LLMs) often require costly human annotations.<n>To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs.<n>While several approaches have been proposed, many confounding factors are present between different papers.
arXiv Detail & Related papers (2025-01-24T17:01:14Z) - JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment [28.4353755578306]
Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks.<n>We introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments.
arXiv Detail & Related papers (2024-12-17T19:04:15Z) - JuStRank: Benchmarking LLM Judges for System Ranking [7.507819077549208]
We conduct the first large-scale study of LLM judges as system rankers.<n>System scores are generated by aggregating judgment scores over multiple system outputs.<n>Our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
arXiv Detail & Related papers (2024-12-12T18:51:13Z) - Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation [2.9180406633632523]
Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment.
Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements.
We look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.
Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions [18.93335792080899]
We investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements.
We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges.
arXiv Detail & Related papers (2024-08-16T14:49:35Z) - Quality-Based Conditional Processing in Multi-Biometrics: Application to
Sensor Interoperability [63.05238390013457]
We describe and evaluate the ATVS-UAM fusion approach submitted to the quality-based evaluation of the 2007 BioSecure Multimodal Evaluation Campaign.
Our approach is based on linear logistic regression, in which fused scores tend to be log-likelihood-ratios.
Results show that the proposed approach outperforms all the rule-based fusion schemes.
arXiv Detail & Related papers (2022-11-24T12:11:22Z) - Towards a multi-stakeholder value-based assessment framework for
algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values.
We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.