Related papers: UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

URL: http://arxiv.org/abs/2508.09724v1
Date: Wed, 13 Aug 2025 11:41:01 GMT
Title: UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge
Authors: Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang,
Abstract summary: We propose UDA, a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system.<n>UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges.<n>Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%.
Score: 23.497453639857852
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.

Related papers

K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge [51.93484138861584]
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods.<n>We propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching.<n>Experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs.
arXiv Detail & Related papers (2026-02-10T05:07:46Z)
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge [10.584937371987742]
Existing LLM-as-a-Judge systems suffer from limited adaptivity to task- and domain-specific evaluation criteria.<n>We propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge.
arXiv Detail & Related papers (2026-02-06T11:35:32Z)
Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning [91.8584139564909]
Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases.<n>We propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle.<n>EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers.
arXiv Detail & Related papers (2026-02-02T01:43:48Z)
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems [32.83708359216193]
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems.<n>This paper systematically investigates judgment biases in two LLM-as-a-judge models under the point-wise scoring setting.<n>We propose four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
arXiv Detail & Related papers (2025-10-14T12:52:29Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation [40.06592175227558]
This paper investigates a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts.<n>We find that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations.<n>Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications.
arXiv Detail & Related papers (2025-09-15T19:20:21Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases.<n>In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Fairness in Ranking under Disparate Uncertainty [24.401219403555814]
We argue that ranking can introduce unfairness if the uncertainty of the underlying relevance model differs between groups of options. We propose Equal-Opportunity Ranking (EOR) as a new fairness criterion for ranking. We show that EOR corresponds to a group-wise fair lottery among the relevant options even in the presence of disparate uncertainty.
arXiv Detail & Related papers (2023-09-04T13:49:48Z)
Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking. xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics. We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z)
Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other. We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
Debiasing Neural Retrieval via In-batch Balancing Regularization [25.941718123899356]
We develop a differentiable textitnormed Pairwise Ranking Fairness (nPRF) and leverage the T-statistics on top of nPRF to improve fairness. Our method with nPRF achieves significantly less bias with minimal degradation in ranking performance compared with the baseline.
arXiv Detail & Related papers (2022-05-18T22:57:15Z)
Unbiased Pairwise Learning to Rank in Recommender Systems [4.058828240864671]
Unbiased learning to rank algorithms are appealing candidates and have already been applied in many applications with single categorical labels. We propose a novel unbiased LTR algorithm to tackle the challenges, which innovatively models position bias in the pairwise fashion. Experiment results on public benchmark datasets and internal live traffic show the superior results of the proposed method for both categorical and continuous labels.
arXiv Detail & Related papers (2021-11-25T06:04:59Z)
Fairness-aware Class Imbalanced Learning [57.45784950421179]
We evaluate long-tail learning methods for tweet sentiment and occupation classification. We extend a margin-loss based approach with methods to enforce fairness.
arXiv Detail & Related papers (2021-09-21T22:16:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.