Related papers: How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation

How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation

URL: http://arxiv.org/abs/2601.09084v2
Date: Thu, 15 Jan 2026 03:47:46 GMT
Title: How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation
Authors: Wilson Y. Lee,
Abstract summary: We show that when preference signal is diffuse across prompts, proportional allocation is minimax-optimal.<n>Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence.
Score: 0.38991526486631006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.

Related papers

K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge [51.93484138861584]
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods.<n>We propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching.<n>Experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs.
arXiv Detail & Related papers (2026-02-10T05:07:46Z)
Direct Preference Optimization with Rating Information: Practical Algorithms and Provable Gains [67.71020482405343]
We study how to design algorithms that can leverage additional information in the form of rating gap.<n>We present new algorithms that can achieve faster statistical rates than DPO in presence of accurate rating gap information.
arXiv Detail & Related papers (2026-01-31T08:38:21Z)
Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences [1.839031891198526]
Law of Comparative Judgment posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring.<n>We develop a deep neural network regression model and a dual-branch pairwise comparison model.<n>Human subject experiments reveal that comparative judgments require $60%$ less annotation time per item.
arXiv Detail & Related papers (2026-01-30T23:13:06Z)
Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics [25.374192139098284]
We study prototypicality bias as a systematic failure mode in multimodal evaluation.<n>We introduce a controlled contrastive benchmark ProtoBias, spanning Animals, Objects, and Demography images.<n>Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs.<n>We propose ProtoScore, a robust 7B- parameter metric that substantially reduces failure rates and suppresses misranking.
arXiv Detail & Related papers (2026-01-08T13:49:14Z)
Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z)
PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models [1.6816171955882597]
PMPO locates low quality prompt segments via a masking based analysis and iteratively rewrites them to propose improved variants.<n>It selects among variants by minimizing loss in a single forward pass, eliminating output sampling and human or judge based scoring for selection.<n>Across model sizes and datasets, PMPO outperforms prior prompts: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA RAT, and raises AlpacaEval 2.0 win rates by over 19 points.
arXiv Detail & Related papers (2025-05-22T06:59:10Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Crowdsourcing subjective annotations using pairwise comparisons reduces bias and error compared to the majority-vote method [0.0]
We introduce a theoretical framework for understanding how random error and measurement bias enter into crowdsourced annotations of subjective constructs. We then propose a pipeline that combines pairwise comparison labelling with Elo scoring, and demonstrate that it outperforms the ubiquitous majority-voting method in reducing both types of measurement error.
arXiv Detail & Related papers (2023-05-31T17:14:12Z)
Few-shot Forgery Detection via Guided Adversarial Interpolation [56.59499187594308]
Existing forgery detection methods suffer from significant performance drops when applied to unseen novel forgery approaches. We propose Guided Adversarial Interpolation (GAI) to overcome the few-shot forgery detection problem. Our method is validated to be robust to choices of majority and minority forgery approaches.
arXiv Detail & Related papers (2022-04-12T16:05:10Z)
Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards. We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z)
Multi-label Contrastive Predictive Coding [125.03510235962095]
Variational mutual information (MI) estimators are widely used in unsupervised representation learning methods such as contrastive predictive coding (CPC) We introduce a novel estimator based on a multi-label classification problem, where the critic needs to jointly identify multiple positive samples at the same time. We show that using the same amount of negative samples, multi-label CPC is able to exceed the $log m$ bound, while still being a valid lower bound of mutual information.
arXiv Detail & Related papers (2020-07-20T02:46:21Z)
Active Sampling for Pairwise Comparisons via Approximate Message Passing and Information Gain Maximization [5.771869590520189]
We propose ASAP, an active sampling algorithm based on approximate message passing and expected information gain. We show that ASAP offers the highest accuracy of inferred scores compared to the existing methods.
arXiv Detail & Related papers (2020-04-12T20:48:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.