Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
- URL: http://arxiv.org/abs/2603.05485v1
- Date: Thu, 05 Mar 2026 18:52:28 GMT
- Title: Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
- Authors: Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar,
- Abstract summary: An autonomous AI system will depend on automated, verifiable rewards and feedback.<n>In settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge.<n>We propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias.
- Score: 11.22990902328416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.
Related papers
- Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations [3.262230127283452]
We show that evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves.<n>We introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model.
arXiv Detail & Related papers (2026-01-30T04:38:18Z) - Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models [55.94503936470247]
Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including judges.<n>Most classical methods assume annotators are conditionally independent given the true label $Yin0,1$, an assumption often violated by LLM judges.<n>We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors.
arXiv Detail & Related papers (2026-01-29T21:26:50Z) - Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems [32.83708359216193]
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems.<n>This paper systematically investigates judgment biases in two LLM-as-a-judge models under the point-wise scoring setting.<n>We propose four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
arXiv Detail & Related papers (2025-10-14T12:52:29Z) - Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations [0.20027036140258694]
New Large Language Models (LLMs) become available every few weeks, and modern application developers confront with the unenviable task of deciding if they should switch to a new model.<n>We show that while LLMs can identify valid outputs with high accuracy, they are remarkably poor at identifying invalid ones.<n>We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent.
arXiv Detail & Related papers (2025-10-13T18:19:23Z) - Judging with Confidence: Calibrating Autoraters to Preference Distributions [56.17041629492863]
We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population.<n>We present two learning methods tailored to different data conditions.<n>Our results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution.
arXiv Detail & Related papers (2025-09-30T20:36:41Z) - Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z) - When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity [21.192000569821943]
We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise.<n>We show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty.<n>Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware benchmarks.
arXiv Detail & Related papers (2025-09-24T16:26:47Z) - Meta-Fair: AI-Assisted Fairness Testing of Large Language Models [2.9632404823837777]
Fairness is a core principle in the development of Artificial Intelligence (AI) systems.<n>Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministics, and curated datasets.<n>This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs.
arXiv Detail & Related papers (2025-07-03T11:20:59Z) - Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models [66.51871176061195]
Decentralized Arena (dearena) is a fully automated framework leveraging collective intelligence from all large language models to evaluate each other.<n> dearena attains up to 97% correlation with human judgements, while significantly reducing the cost.
arXiv Detail & Related papers (2025-05-19T07:34:25Z) - ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition [52.537021302246664]
Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance)<n>We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes.<n>We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51.
arXiv Detail & Related papers (2025-01-31T20:47:06Z) - Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.