Who can we trust? LLM-as-a-jury for Comparative Assessment
- URL: http://arxiv.org/abs/2602.16610v1
- Date: Wed, 18 Feb 2026 17:04:02 GMT
- Title: Who can we trust? LLM-as-a-jury for Comparative Assessment
- Authors: Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill,
- Abstract summary: Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment.<n>LLMs judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent.<n>We propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone.
- Score: 42.32900791516691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.
Related papers
- Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation [20.16938320120462]
M-JudgeBench is a capability-oriented benchmark to comprehensively assess the judgment abilities of MLLMs.<n>Judge-MCTS is a data construction framework generating pairwise reasoning trajectories with various correctness and length.<n>Our work establishes a more principled foundation for evaluating MLLM-as-a-judge through M-JudgeBench and Judge-MCTS framework.
arXiv Detail & Related papers (2026-02-28T08:49:22Z) - A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth [4.9467757325435775]
evaluating large language models (LLMs) on open-ended tasks is increasingly done via the LLM-as-a-judge paradigm.<n>Treating all judges equally can yield biased leaderboards and misleading uncertainty estimates.<n>We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters.
arXiv Detail & Related papers (2026-01-29T15:01:28Z) - JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation [13.831735556002426]
Small language models (SLMs) have shown promise on various reasoning tasks.<n>Their ability to judge the correctness of answers remains unclear compared to large language models (LLMs)
arXiv Detail & Related papers (2025-11-20T01:14:39Z) - Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge [17.40713507922006]
Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other outputs.<n>LLMs may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias.<n>We present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated.
arXiv Detail & Related papers (2025-08-08T21:22:12Z) - Quantitative LLM Judges [60.773734899532336]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z) - Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z) - Verdict: A Library for Scaling Judge-Time Compute [5.468405526095168]
Verdict is an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators.<n>Verdict achieves performance competitive with orders-of-magnitude larger fine-tuned judges.
arXiv Detail & Related papers (2025-02-25T09:26:44Z) - Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates [11.948519516797745]
We develop an open-source framework to evaluate, compare, and visualize the reliability and alignment of LLM judges.<n>Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.
arXiv Detail & Related papers (2024-08-23T11:49:01Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.