Related papers: When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

URL: http://arxiv.org/abs/2509.20293v3
Date: Wed, 08 Oct 2025 10:11:46 GMT
Title: When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity
Authors: Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson,
Abstract summary: We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise.<n>We show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty.<n>Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware benchmarks.
Score: 21.192000569821943
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at https://github.com/penfever/judgment-to-noise

Related papers

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning [0.6138671548064355]
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning.<n>We introduce C2-Faith, a benchmark that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage.<n>We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring.
arXiv Detail & Related papers (2026-03-05T13:36:47Z)
Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models [55.94503936470247]
Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including judges.<n>Most classical methods assume annotators are conditionally independent given the true label $Yin0,1$, an assumption often violated by LLM judges.<n>We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors.
arXiv Detail & Related papers (2026-01-29T21:26:50Z)
A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth [4.9467757325435775]
evaluating large language models (LLMs) on open-ended tasks is increasingly done via the LLM-as-a-judge paradigm.<n>Treating all judges equally can yield biased leaderboards and misleading uncertainty estimates.<n>We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters.
arXiv Detail & Related papers (2026-01-29T15:01:28Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation [15.787947727055611]
We introduce RULERS, a compiler-executor framework that transforms natural language rubrics into executable specifications.<n>RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration.
arXiv Detail & Related papers (2026-01-13T15:31:42Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs [3.9977256267361754]
We present Nazonazo, a cost-effective benchmark built from Japanese children's riddles to test insight-based reasoning.<n>No model except for GPT-5 is comparable to human performance, which achieves a 52.9% mean accuracy.
arXiv Detail & Related papers (2025-09-18T07:50:04Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Judging LLMs on a Simplex [2.088672652658465]
A common practice is to use large language models (LLMs) themselves as judges, but the theoretical properties of this approach are not yet well understood.<n>We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable.
arXiv Detail & Related papers (2025-05-28T04:50:41Z)
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges [3.168632659778101]
We highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge.<n>We show that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones.
arXiv Detail & Related papers (2025-03-06T14:24:12Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.<n>Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.<n>We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification [31.392067805022414]
Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. We develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary.
arXiv Detail & Related papers (2023-01-27T06:52:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.