Related papers: CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

URL: http://arxiv.org/abs/2603.01865v1
Date: Mon, 02 Mar 2026 13:46:32 GMT
Title: CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Authors: Ziyi Zhu, Olivier Tieleman, Alexey Bukhtiyarov, Jinghong Chen,
Abstract summary: This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components.<n>Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy.
Score: 6.3121191919394475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy. It eliminates bias precisely while requiring each judge only once per cycle, maintaining the cost of single-judge evaluation. Empirical validation on MT-Bench supports all theoretical predictions.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
Who can we trust? LLM-as-a-jury for Comparative Assessment [42.32900791516691]
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment.<n>LLMs judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent.<n>We propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone.
arXiv Detail & Related papers (2026-02-18T17:04:02Z)
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge [10.584937371987742]
Existing LLM-as-a-Judge systems suffer from limited adaptivity to task- and domain-specific evaluation criteria.<n>We propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge.
arXiv Detail & Related papers (2026-02-06T11:35:32Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Evaluating Scoring Bias in LLM-as-a-Judge [8.67484421243584]
Large Language Models (LLMs) are employed as evaluators for complex tasks.<n>There are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments.
arXiv Detail & Related papers (2025-06-27T15:25:23Z)
Quantitative LLM Judges [60.773734899532336]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.83088028268318]
This paper introduces the Judge Evaluation for Test-Time Scaling benchmark.<n>It evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings.<n>Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures.
arXiv Detail & Related papers (2025-04-21T17:33:23Z)
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.