Validating LLM-as-a-Judge Systems under Rating Indeterminacy
- URL: http://arxiv.org/abs/2503.05965v4
- Date: Mon, 27 Oct 2025 16:18:11 GMT
- Title: Validating LLM-as-a-Judge Systems under Rating Indeterminacy
- Authors: Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova,
- Abstract summary: We introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy.<n>We demonstrate that differences in how humans and LLMs resolve rating indeterminacy when responding to forced-choice rating instructions can heavily bias validation.
- Score: 65.137380612741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human--judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct." We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human--judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy when responding to forced-choice rating instructions can heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 9 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 31% worse than judge systems selected by our approach that uses multi-label "response set" ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.
Related papers
- Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems [2.9141470183751674]
We propose a dynamic, learning-based framework for scalable and context-aware evaluation.<n>Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts.<n> Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines.
arXiv Detail & Related papers (2025-12-01T15:26:20Z) - Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems [32.83708359216193]
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems.<n>This paper systematically investigates judgment biases in two LLM-as-a-judge models under the point-wise scoring setting.<n>We propose four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
arXiv Detail & Related papers (2025-10-14T12:52:29Z) - Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges [22.7340872046127]
We propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model.<n>Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment.
arXiv Detail & Related papers (2025-08-01T09:26:01Z) - Quantitative LLM Judges [48.676042957523045]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain.<n>The models are trained to improve the score of the original judge by using the judge's textual evaluation and score.<n>Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z) - MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered [2.8692611791027893]
We present MALIBU, a novel benchmark developed to assess the degree to which multi-agent systems implicitly reinforce social biases and stereotypes.<n>Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality.
arXiv Detail & Related papers (2025-04-10T19:16:40Z) - Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z) - Tuning LLM Judge Design Decisions for 1/1000 of the Cost [42.06346155380305]
Large Language Models (LLMs) often require costly human annotations.<n>To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs.<n>While several approaches have been proposed, many confounding factors are present between different papers.
arXiv Detail & Related papers (2025-01-24T17:01:14Z) - Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference [63.03859517284341]
An automatic evaluation framework aims to rank LLMs based on their alignment with human preferences.<n>An automatic LLM bencher consists of four components: the input set, the evaluation model, the evaluation type and the aggregation method.
arXiv Detail & Related papers (2024-12-31T17:46:51Z) - JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment [28.4353755578306]
Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks.<n>We introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments.
arXiv Detail & Related papers (2024-12-17T19:04:15Z) - JuStRank: Benchmarking LLM Judges for System Ranking [7.507819077549208]
We conduct the first large-scale study of LLM judges as system rankers.<n>System scores are generated by aggregating judgment scores over multiple system outputs.<n>Our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
arXiv Detail & Related papers (2024-12-12T18:51:13Z) - Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation [2.9180406633632523]
Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment.
Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements.
We look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.
Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions [18.93335792080899]
We investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements.
We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges.
arXiv Detail & Related papers (2024-08-16T14:49:35Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Quality-Based Conditional Processing in Multi-Biometrics: Application to
Sensor Interoperability [63.05238390013457]
We describe and evaluate the ATVS-UAM fusion approach submitted to the quality-based evaluation of the 2007 BioSecure Multimodal Evaluation Campaign.
Our approach is based on linear logistic regression, in which fused scores tend to be log-likelihood-ratios.
Results show that the proposed approach outperforms all the rule-based fusion schemes.
arXiv Detail & Related papers (2022-11-24T12:11:22Z) - Towards a multi-stakeholder value-based assessment framework for
algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values.
We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.