Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction
- URL: http://arxiv.org/abs/2509.18658v1
- Date: Tue, 23 Sep 2025 05:26:28 GMT
- Title: Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction
- Authors: Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, Jian Kang,
- Abstract summary: This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction.<n>We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees.
- Score: 13.958280616597385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.
Related papers
- Interval-Based AUC (iAUC): Extending ROC Analysis to Uncertainty-Aware Classification [12.024101882027466]
We propose an uncertainty-aware ROC framework specifically for interval-valued predictions.<n>We introduce two new measures: $AUC_L$ and $AUC_U$.<n>We prove that under valid class-conditional coverage, $AUC_L$ and $AUC_U$ provide formal lower and upper bounds on the theoretical optimal AUC.
arXiv Detail & Related papers (2026-02-04T17:12:04Z) - Conformal Prediction Algorithms for Time Series Forecasting: Methods and Benchmarking [0.0]
Time series temporal dependencies violate the core assumption of data exchangeability.<n>This paper critically examines the main categories of algorithmic solutions designed to address this conflict.<n>We use AutoARIMA as the base forecaster on a large-scale monthly sales dataset.
arXiv Detail & Related papers (2026-01-26T14:15:08Z) - Localized Uncertainty Quantification in Random Forests via Proximities [1.0195618602298684]
In machine learning, uncertainty quantification helps assess the reliability of model predictions.<n>Traditional approaches often emphasize predictive accuracy, but there is a growing focus on incorporating uncertainty measures.<n>We propose a new approach using naturally occurring test sets and similarity measures (proximities) typically viewed as byproducts of random forests.
arXiv Detail & Related papers (2025-09-26T20:53:28Z) - Conformal Prediction Sets with Improved Conditional Coverage using Trust Scores [52.92618442300405]
It is impossible to achieve exact, distribution-free conditional coverage in finite samples.<n>We propose an alternative conformal prediction algorithm that targets coverage where it matters most.
arXiv Detail & Related papers (2025-01-17T12:01:56Z) - Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.<n>We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z) - Bin-Conditional Conformal Prediction of Fatalities from Armed Conflict [0.5312303275762104]
We introduce bin-conditional conformal prediction (BCCP), which enhances standard conformal prediction by ensuring consistent coverage rates across user-defined subsets.<n>Compared to standard conformal prediction, BCCP offers improved local coverage, though this comes at the cost of slightly wider prediction intervals.
arXiv Detail & Related papers (2024-10-18T14:41:42Z) - Score Matching-based Pseudolikelihood Estimation of Neural Marked
Spatio-Temporal Point Process with Uncertainty Quantification [59.81904428056924]
We introduce SMASH: a Score MAtching estimator for learning markedPs with uncertainty quantification.
Specifically, our framework adopts a normalization-free objective by estimating the pseudolikelihood of markedPs through score-matching.
The superior performance of our proposed framework is demonstrated through extensive experiments in both event prediction and uncertainty quantification.
arXiv Detail & Related papers (2023-10-25T02:37:51Z) - Conformalizing Machine Translation Evaluation [9.89901717499058]
Several uncertainty estimation methods have been recently proposed for machine translation evaluation.
We show that the majority of them tend to underestimate model uncertainty, and as a result they often produce misleading confidence intervals that do not cover the ground truth.
We propose as an alternative the use of conformal prediction, a distribution-free method to obtain confidence intervals with a theoretically established guarantee on coverage.
arXiv Detail & Related papers (2023-06-09T19:36:18Z) - Regions of Reliability in the Evaluation of Multivariate Probabilistic
Forecasts [73.33395097728128]
We provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation.
We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions.
arXiv Detail & Related papers (2023-04-19T17:38:42Z) - Conformal Prediction Intervals for Remaining Useful Lifetime Estimation [5.171601921549565]
We investigate the conformal prediction (CP) framework that represents uncertainty by predicting sets of possible values for the target variable.
CP formally guarantees that the actual value (true RUL) is covered by the predicted set with a degree of certainty that can be prespecified.
We study three CP algorithms to conformalize any single-point RUL predictor and turn it into a valid interval predictor.
arXiv Detail & Related papers (2022-12-30T09:34:29Z) - Uncertainty estimation of pedestrian future trajectory using Bayesian
approximation [137.00426219455116]
Under dynamic traffic scenarios, planning based on deterministic predictions is not trustworthy.
The authors propose to quantify uncertainty during forecasting using approximation which deterministic approaches fail to capture.
The effect of dropout weights and long-term prediction on future state uncertainty has been studied.
arXiv Detail & Related papers (2022-05-04T04:23:38Z) - How to Evaluate Uncertainty Estimates in Machine Learning for
Regression? [1.4610038284393165]
We show that both approaches to evaluating the quality of uncertainty estimates have serious flaws.
Firstly, both approaches cannot disentangle the separate components that jointly create the predictive uncertainty.
Thirdly, the current approach to test prediction intervals directly has additional flaws.
arXiv Detail & Related papers (2021-06-07T07:47:46Z) - Interpretable Machines: Constructing Valid Prediction Intervals with
Random Forests [0.0]
An important issue when using Machine Learning algorithms in recent research is the lack of interpretability.
A contribution to this gap for the Random Forest Regression Learner is presented here.
Several parametric and non-parametric prediction intervals are provided for Random Forest point predictions.
A thorough investigation through Monte-Carlo simulation is conducted evaluating the performance of the proposed methods.
arXiv Detail & Related papers (2021-03-09T23:05:55Z) - AutoCP: Automated Pipelines for Accurate Prediction Intervals [84.16181066107984]
This paper proposes an AutoML framework called Automatic Machine Learning for Conformal Prediction (AutoCP)
Unlike the familiar AutoML frameworks that attempt to select the best prediction model, AutoCP constructs prediction intervals that achieve the user-specified target coverage rate.
We tested AutoCP on a variety of datasets and found that it significantly outperforms benchmark algorithms.
arXiv Detail & Related papers (2020-06-24T23:13:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.