Related papers: Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge

Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge

URL: http://arxiv.org/abs/2505.15240v1
Date: Wed, 21 May 2025 08:16:18 GMT
Title: Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge
Authors: Yassir Fathullah, Mark J. F. Gales,
Abstract summary: We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options.<n>We propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations.
Score: 37.84914870036184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.

Related papers

Always Tell Me The Odds: Fine-grained Conditional Probability Estimation [37.950889606305836]
We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context.<n>We show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.
arXiv Detail & Related papers (2025-05-02T21:33:18Z)
Enhancing accuracy of uncertainty estimation in appearance-based gaze tracking with probabilistic evaluation and calibration [13.564919425738163]
Uncertainty in appearance-based gaze tracking is critical for ensuring reliable downstream applications.<n>Current uncertainty-aware approaches adopt probabilistic models to acquire uncertainties by following distributions in the training dataset.<n>We propose a correction strategy based on probability calibration to mitigate biases in the estimated uncertainties of the trained models.
arXiv Detail & Related papers (2025-01-24T19:33:55Z)
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)<n> Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.<n>Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z)
On Uncertainty Calibration and Selective Generation in Probabilistic Neural Summarization: A Benchmark Study [14.041071717005362]
Modern deep models for summarization attains impressive benchmark performance, but they are prone to generating miscalibrated predictive uncertainty. This means that they assign high confidence to low-quality predictions, leading to compromised reliability and trustworthiness in real-world applications. Probabilistic deep learning methods are common solutions to the miscalibration problem, but their relative effectiveness in complex autoregressive summarization tasks are not well-understood.
arXiv Detail & Related papers (2023-04-17T23:06:28Z)
Uncertainty-Driven Action Quality Assessment [11.958132175629368]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.<n>We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.<n>Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z)
DEUP: Direct Epistemic Uncertainty Prediction [56.087230230128185]
Epistemic uncertainty is part of out-of-sample prediction error due to the lack of knowledge of the learner. We propose a principled approach for directly estimating epistemic uncertainty by learning to predict generalization error and subtracting an estimate of aleatoric uncertainty.
arXiv Detail & Related papers (2021-02-16T23:50:35Z)
Characterizing Fairness Over the Set of Good Models Under Selective Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance. We provide tractable algorithms to compute the range of attainable group-level predictive disparities. We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
Uncertainty-Aware Few-Shot Image Classification [118.72423376789062]
Few-shot image classification learns to recognize new categories from limited labelled data. We propose Uncertainty-Aware Few-Shot framework for image classification.
arXiv Detail & Related papers (2020-10-09T12:26:27Z)
Efficient Ensemble Model Generation for Uncertainty Estimation with Bayesian Approximation in Segmentation [74.06904875527556]
We propose a generic and efficient segmentation framework to construct ensemble segmentation models. In the proposed method, ensemble models can be efficiently generated by using the layer selection method. We also devise a new pixel-wise uncertainty loss, which improves the predictive performance.
arXiv Detail & Related papers (2020-05-21T16:08:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.