Related papers: TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

URL: http://arxiv.org/abs/2509.21117v2
Date: Fri, 26 Sep 2025 05:33:48 GMT
Title: TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Authors: Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang,
Abstract summary: Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
Score: 58.04324690859212
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.

Related papers

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning [0.6138671548064355]
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning.<n>We introduce C2-Faith, a benchmark that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage.<n>We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring.
arXiv Detail & Related papers (2026-03-05T13:36:47Z)
Who can we trust? LLM-as-a-jury for Comparative Assessment [42.32900791516691]
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment.<n>LLMs judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent.<n>We propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone.
arXiv Detail & Related papers (2026-02-18T17:04:02Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge [5.855996386998925]
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level.<n>We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item.
arXiv Detail & Related papers (2025-12-02T18:46:47Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation [40.06592175227558]
This paper investigates a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts.<n>We find that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations.<n>Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications.
arXiv Detail & Related papers (2025-09-15T19:20:21Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Judging LLMs on a Simplex [2.088672652658465]
A common practice is to use large language models (LLMs) themselves as judges, but the theoretical properties of this approach are not yet well understood.<n>We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable.
arXiv Detail & Related papers (2025-05-28T04:50:41Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels [16.300463494913593]
Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
arXiv Detail & Related papers (2025-02-20T05:09:29Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
LLMs Can Patch Up Missing Relevance Judgments in Evaluation [56.51461892988846]
We use large language models (LLMs) to automatically label unjudged documents. We simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks. Our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicuna-7B and GPT-3.5 Turbo respectively.
arXiv Detail & Related papers (2024-05-08T00:32:19Z)
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists [12.542045913426639]
CheckEval is a checklist-based evaluation framework that improves rating reliability via binary questions.<n>CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance.
arXiv Detail & Related papers (2024-03-27T17:20:39Z)
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts. We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z)
Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification. We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate. We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.