Best Arm Identification with LLM Judges and Limited Human
- URL: http://arxiv.org/abs/2601.21471v1
- Date: Thu, 29 Jan 2026 09:50:34 GMT
- Title: Best Arm Identification with LLM Judges and Limited Human
- Authors: Ruicheng Ao, Hongyu Chen, Siyang Gao, Hanwei Li, David Simchi-Levi,
- Abstract summary: We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy is available for every sample.<n>We develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals.<n>Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms.
- Score: 18.85883540190321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy (e.g., LLM judge) is available for every sample, while an expensive ground-truth label can only be acquired selectively when using a human for auditing. Unlike classical multi-fidelity BAI, the proxy is biased (arm- and context-dependent) and ground truth is selectively observed. Consequently, standard multi-fidelity methods can mis-select the best arm, and uniform auditing, though accurate, wastes scarce resources and is inefficient. We prove that without bias correction and propensity adjustment, mis-selection probability may not vanish (even with unlimited proxy data). We then develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals and form anytime-valid confidence sequences for that estimator. Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms. The algorithm concentrates audits on unreliable contexts and close arms and we prove that a plug-in Neyman rule achieves near-oracle audit efficiency. Numerical experiments confirm the theoretical guarantees and demonstrate the superior empirical performance of the proposed algorithm.
Related papers
- Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges [14.256638949961063]
We introduce a "Noisy but Valid" hypothesis testing framework to address this.<n>Our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty.
arXiv Detail & Related papers (2026-01-28T18:05:06Z) - Cost Efficient Fairness Audit Under Partial Feedback [14.57835291220813]
We study the problem of auditing the fairness of a given classifier under partial feedback.<n>We introduce a novel cost model for acquiring additional labeled data.<n>We show that our algorithms consistently outperform natural baselines by around 50% in terms of audit cost.
arXiv Detail & Related papers (2025-10-04T08:38:03Z) - Judging with Confidence: Calibrating Autoraters to Preference Distributions [56.17041629492863]
We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population.<n>We present two learning methods tailored to different data conditions.<n>Our results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution.
arXiv Detail & Related papers (2025-09-30T20:36:41Z) - Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z) - SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models [0.27309692684728604]
Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels.<n>This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner.
arXiv Detail & Related papers (2025-07-24T08:28:17Z) - COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z) - A Principled Approach to Randomized Selection under Uncertainty: Applications to Peer Review and Grant Funding [61.86327960322782]
We propose a principled framework for randomized decision-making based on interval estimates of the quality of each item.<n>We introduce MERIT, an optimization-based method that maximizes the worst-case expected number of top candidates selected.<n>We prove that MERIT satisfies desirable axiomatic properties not guaranteed by existing approaches.
arXiv Detail & Related papers (2025-06-23T19:59:30Z) - Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability.<n>We validate our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity.
arXiv Detail & Related papers (2025-06-03T02:56:26Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Individually Fair Learning with One-Sided Feedback [15.713330010191092]
We consider an online learning problem with one-sided feedback, in which the learner is able to observe the true label only for positively predicted instances.
On each round, $k$ instances arrive and receive classification outcomes according to a randomized policy deployed by the learner.
We then construct an efficient reduction from our problem of online learning with one-sided feedback and a panel reporting fairness violations to the contextual semi-bandit problem.
arXiv Detail & Related papers (2022-06-09T12:59:03Z) - Trustworthy Preference Completion in Social Choice [36.91054060923998]
It is impractical to ask agents to provide linear orders over all alternatives, for these partial rankings it is necessary to conduct preference completion.
A trust-based anchor-kNN algorithm is proposed to find $k$-nearest trustworthy neighbors of the agent with trust-oriented Kendall-Tau distances.
A certain common voting rule for the first $k$ trustworthy neighboring agents based on certainty and conflict can be taken to conduct the trustworthy preference completion.
arXiv Detail & Related papers (2020-12-14T03:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.