LLM-as-Judge on a Budget
- URL: http://arxiv.org/abs/2602.15481v1
- Date: Tue, 17 Feb 2026 10:35:41 GMT
- Title: LLM-as-Judge on a Budget
- Authors: Aadirupa Saha, Aniket Wagde, Branislav Kveton,
- Abstract summary: We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities.<n>Our algorithm is shown to achieve a worst-case score-estimation error of $tildeOleft(sqrtfracsum_i=1K_i2Bright)$.<n>Experiments on emphSummarize-From-Feedback and emphHelpSteer2 demonstrate that our method significantly outperforms uniform allocation.
- Score: 35.393598355979385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K σ_i^2}{B}}\right)$, $σ_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.
Related papers
- $\
abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z) - Almost Asymptotically Optimal Active Clustering Through Pairwise Observations [59.20614082241528]
We propose a new analysis framework for clustering $M$ items into an unknown number of $K$ distinct groups using noisy and actively collected responses.<n>We establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the accuracy of the clustering.<n>We develop a computationally feasible variant of the Generalized Likelihood Ratio statistic and show that its performance gap to the lower bound can be accurately empirically estimated.
arXiv Detail & Related papers (2026-02-05T14:16:47Z) - Improving Value-based Process Verifier via Low-Cost Variance Reduction [24.609940184050043]
Large language models (LLMs) have achieved remarkable success in a wide range of tasks.<n>However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge.<n>Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning.
arXiv Detail & Related papers (2025-08-14T11:22:29Z) - Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning [50.93804891554481]
We introduce a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators.<n>Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions.<n>In the off-policy learning scenario, we establish bounds on the regret -- the performance gap between our LSE estimator and the optimal policy.
arXiv Detail & Related papers (2025-06-07T17:37:10Z) - Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability.<n>We validate our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity.
arXiv Detail & Related papers (2025-06-03T02:56:26Z) - Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z) - A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP [1.0923877073891446]
We analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation.<n>We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging.<n>These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Efficient Algorithms for Empirical Group Distributional Robust
Optimization and Beyond [15.664414751701718]
We formulate empirical GDRO as a $textittwo-level$ finite-sum convex-concave minimax optimization problem.
We compute the snapshot and mirror snapshot point by a one-index-shifted weighted average, which distinguishes us from the naive ergodic average.
Remarkably, our approach outperforms the state-of-the-art method by a factor of $sqrtm$.
arXiv Detail & Related papers (2024-03-06T09:14:24Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z) - When are Local Queries Useful for Robust Learning? [25.832511407411637]
We study learning models where the learner is given more power through the use of local queries.
We give the first distribution-free algorithms that perform robust empirical risk minimization.
We finish by giving robust learning algorithms for halfspaces on $0,1n$ and then obtaining robustness guarantees for halfspaces in $mathbbRn$ against precision-bounded adversaries.
arXiv Detail & Related papers (2022-10-12T11:04:22Z) - Minimum mean-squared error estimation with bandit feedback [10.660855209170586]
We consider the problem of sequentially learning to estimate, in the mean squared error (MSE) sense.<n>We propose two MSE estimators, and analyze their concentration properties.
arXiv Detail & Related papers (2022-03-31T05:33:32Z) - Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed
Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$.
We propose a novel robust estimator which does not require $nu_p$ as prior information.
We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.