Related papers: Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models

Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models

URL: http://arxiv.org/abs/2601.22336v1
Date: Thu, 29 Jan 2026 21:26:50 GMT
Title: Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models
Authors: Krishnakumar Balasubramanian, Aleksandr Podkopaev, Shiva Prasad Kasiviswanathan,
Abstract summary: Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including judges.<n>Most classical methods assume annotators are conditionally independent given the true label $Yin0,1$, an assumption often violated by LLM judges.<n>We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors.
Score: 55.94503936470247
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label $Y\in\{0,1\}$, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-$K$ examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.

Related papers

A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth [4.9467757325435775]
evaluating large language models (LLMs) on open-ended tasks is increasingly done via the LLM-as-a-judge paradigm.<n>Treating all judges equally can yield biased leaderboards and misleading uncertainty estimates.<n>We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters.
arXiv Detail & Related papers (2026-01-29T15:01:28Z)
Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge [5.855996386998925]
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level.<n>We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item.
arXiv Detail & Related papers (2025-12-02T18:46:47Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Streaming algorithms for evaluating noisy judges on unlabeled data -- binary classification [0.0]
We search for nearly error independent trios by using the algebraic failure modes to reject evaluation ensembles as too correlated. The results produced by the surviving ensembles can sometimes be as good as 1%. A Taylor expansion of the estimates produced when independence is assumed but the classifiers are, in fact, slightly correlated helps clarify how the independent evaluator has algebraic blind spots'
arXiv Detail & Related papers (2023-06-02T17:52:59Z)
ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training [110.52785254565518]
Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy by forcing a new model to imitate the old models, or use ensembles. We analyze the role of ensembles in reducing NFR and observe that they remove negative flips that are typically not close to the decision boundary. We present a method, called Ensemble Logit Difference Inhibition (ELODI), to train a classification system that achieves paragon performance in both error rate and NFR.
arXiv Detail & Related papers (2022-05-12T17:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.