How to Correctly Report LLM-as-a-Judge Evaluations
- URL: http://arxiv.org/abs/2511.21140v1
- Date: Wed, 26 Nov 2025 07:46:46 GMT
- Title: How to Correctly Report LLM-as-a-Judge Evaluations
- Authors: Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee,
- Abstract summary: Large language models (LLMs) are increasingly used as evaluators in lieu of humans.<n>While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs.<n>This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset.
- Score: 13.389479132464778
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.
Related papers
- On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z) - Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation [25.80946316489521]
We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges' hidden states.<n>We evaluate our approach on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments.
arXiv Detail & Related papers (2025-12-23T22:08:46Z) - Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns [67.24756301536617]
We propose a Generalized Correctness Model (GCM) to generate accurate and calibrated confidence estimates.<n>We first show that GCMs can be trained on the correctness data from many LLMs.<n>We then use CMs as a lens for studying the source of correctness prediction ability and its generalization.
arXiv Detail & Related papers (2025-09-29T16:19:01Z) - Calibrated and Efficient Sampling-Free Confidence Estimation for LiDAR Scene Semantic Segmentation [1.8861801513235323]
We introduce a sampling-free approach for estimating well-calibrated confidence values for classification tasks.<n>Our approach maintains well-calibrated confidence values while achieving increased processing speed.<n>Our method produces underconfidence rather than overconfident predictions, an advantage for safety-critical applications.
arXiv Detail & Related papers (2024-11-18T15:13:20Z) - Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead.<n>We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Calibrating Long-form Generations from Large Language Models [34.72041258464477]
Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct.
Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness.
We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
arXiv Detail & Related papers (2024-02-09T17:00:32Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Better Uncertainty Calibration via Proper Scores for Classification and
Beyond [15.981380319863527]
We introduce the framework of proper calibration errors, which relates every calibration error to a proper score.
This relationship can be used to reliably quantify the model calibration improvement.
arXiv Detail & Related papers (2022-03-15T12:46:08Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.