Related papers: Efficient Bayesian Inference from Noisy Pairwise Comparisons

Efficient Bayesian Inference from Noisy Pairwise Comparisons

URL: http://arxiv.org/abs/2510.09333v1
Date: Fri, 10 Oct 2025 12:37:20 GMT
Title: Efficient Bayesian Inference from Noisy Pairwise Comparisons
Authors: Till Aczel, Lucas Theis, Wattenhofer Roger,
Abstract summary: Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees.<n>We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants.<n> Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings.
Score: 7.5294643377975765
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

Related papers

K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge [51.93484138861584]
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods.<n>We propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching.<n>Experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs.
arXiv Detail & Related papers (2026-02-10T05:07:46Z)
A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth [4.9467757325435775]
evaluating large language models (LLMs) on open-ended tasks is increasingly done via the LLM-as-a-judge paradigm.<n>Treating all judges equally can yield biased leaderboards and misleading uncertainty estimates.<n>We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters.
arXiv Detail & Related papers (2026-01-29T15:01:28Z)
How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation [0.38991526486631006]
We show that when preference signal is diffuse across prompts, proportional allocation is minimax-optimal.<n>Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence.
arXiv Detail & Related papers (2026-01-14T02:34:58Z)
UniCBE: An Uniformity-driven Comparing Based Evaluation Framework with Unified Multi-Objective Optimization [19.673388630963807]
We propose UniCBE, a unified uniformity-driven CBE framework.<n>On the AlpacaEval benchmark, UniCBE saves over 17% of evaluation budgets while achieving a Pearson correlation with ground truth exceeding 0.995.<n>In scenarios where new models are continuously introduced, UniCBE can even save over 50% of evaluation costs.
arXiv Detail & Related papers (2025-02-17T05:28:12Z)
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback [64.67540769692074]
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date.<n>We introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models.<n>Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench.
arXiv Detail & Related papers (2024-10-04T04:56:11Z)
Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification. We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate. We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z)
Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models [7.770029179741429]
We propose P-precision and P-recall (PP&PR), based on a probabilistic approach that address the problems. We show that our PP&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics.
arXiv Detail & Related papers (2023-09-04T13:19:17Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
On Uncertainty Calibration and Selective Generation in Probabilistic Neural Summarization: A Benchmark Study [14.041071717005362]
Modern deep models for summarization attains impressive benchmark performance, but they are prone to generating miscalibrated predictive uncertainty. This means that they assign high confidence to low-quality predictions, leading to compromised reliability and trustworthiness in real-world applications. Probabilistic deep learning methods are common solutions to the miscalibration problem, but their relative effectiveness in complex autoregressive summarization tasks are not well-understood.
arXiv Detail & Related papers (2023-04-17T23:06:28Z)
Restricted Bernoulli Matrix Factorization: Balancing the trade-off between prediction accuracy and coverage in classification based collaborative filtering [45.335821132209766]
We propose Restricted Bernoulli Matrix Factorization (ResBeMF) to enhance the performance of classification-based collaborative filtering. The proposed model provides a good balance in terms of the quality measures used compared to other recommendation models.
arXiv Detail & Related papers (2022-10-05T13:48:19Z)
Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation [59.500347564280204]
We propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework. AUR consists of a new uncertainty estimator along with a normal recommender model. As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty.
arXiv Detail & Related papers (2022-09-22T04:32:51Z)
A Study on Mitigating Hard Boundaries of Decision-Tree-based Uncertainty Estimates for AI Models [0.0]
Uncertainty wrappers use a decision tree approach to cluster input quality related uncertainties, assigning inputs strictly to distinct uncertainty clusters. Our objective is to replace this with an approach that mitigates hard decision boundaries while preserving interpretability, runtime complexity, and prediction performance.
arXiv Detail & Related papers (2022-01-10T10:29:12Z)
Efficient Ensemble Model Generation for Uncertainty Estimation with Bayesian Approximation in Segmentation [74.06904875527556]
We propose a generic and efficient segmentation framework to construct ensemble segmentation models. In the proposed method, ensemble models can be efficiently generated by using the layer selection method. We also devise a new pixel-wise uncertainty loss, which improves the predictive performance.
arXiv Detail & Related papers (2020-05-21T16:08:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.