Related papers: Judging LLMs on a Simplex

Judging LLMs on a Simplex

URL: http://arxiv.org/abs/2505.21972v1
Date: Wed, 28 May 2025 04:50:41 GMT
Title: Judging LLMs on a Simplex
Authors: Patrick Vossler, Fan Xia, Yifan Mai, Jean Feng,
Abstract summary: A common practice is to use large language models (LLMs) themselves as judges, but the theoretical properties of this approach are not yet well understood.<n>We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable.
Score: 2.088672652658465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated evaluation of free-form outputs from large language models (LLMs) is challenging because many distinct answers can be equally valid. A common practice is to use LLMs themselves as judges, but the theoretical properties of this approach are not yet well understood. We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable using LLM judges. Our theoretical analysis uncovers a "phase transition" in ranking identifiability: for binary scoring systems, true rankings are identifiable even with weak judges under mild assumptions, while rankings become non-identifiable for three or more scoring levels even with infinite data, absent additional prior knowledge. This non-identifiability highlights how uncertainty in rankings stems from not only aleatoric uncertainty (i.e., inherent stochasticity in the data) but also epistemic uncertainty regarding which assumptions hold, an aspect that has received limited attention until now. To integrate both types of uncertainty, we use Bayesian inference to encode assumptions as priors and conduct sensitivity analysis of ranking estimates and credible intervals. Empirical evaluations across multiple benchmarks demonstrate that Bayesian inference yields more accurate rankings and substantially improves coverage rates. These results underscore the importance of taking a more holistic approach to uncertainty quantification when using LLMs as judges.

Related papers

Quantifying Query Fairness Under Unawareness [82.33181164973365]
We introduce a robust fairness estimator based on quantification that effectively handles multiple sensitive attributes beyond binary classifications.<n>Our method outperforms existing baselines across various sensitive attributes and is the first to establish a reliable protocol for measuring fairness under unawareness.
arXiv Detail & Related papers (2025-06-04T16:31:44Z)
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z)
Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation [19.66750942418172]
Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All.<n>In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes.<n>Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases.
arXiv Detail & Related papers (2025-03-29T04:36:25Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation [20.022623972491733]
Ranking large language models (LLMs) has proven to be an effective tool to improve alignment based on the best-of-$N$ policy.<n>We propose a new inferential framework for hypothesis testing among the ranking for language models.
arXiv Detail & Related papers (2024-12-07T02:34:30Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
Evaluating language models as risk scores [23.779329697527054]
We introduce folktexts, a software package to generate risk scores using question-answering LLMs. We evaluate 17 recent LLMs across five proposed benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.
arXiv Detail & Related papers (2024-07-19T18:13:37Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification. We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate. We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z)
When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
Evaluating AI systems under uncertain ground truth: a case study in dermatology [43.8328264420381]
We show that ignoring uncertainty leads to overly optimistic estimates of model performance.<n>In skin condition classification, we find that a large portion of the dataset exhibits significant ground truth uncertainty.
arXiv Detail & Related papers (2023-07-05T10:33:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.