Related papers: Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling

Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling

URL: http://arxiv.org/abs/2510.03199v1
Date: Fri, 03 Oct 2025 17:35:45 GMT
Title: Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling
Authors: Qiwei Di, Kaixuan Ji, Xuheng Li, Heyang Zhao, Quanquan Gu,
Abstract summary: LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of-N (BoN)<n>We propose Best-of-Majority (BoM) with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards.<n>Unlike majority voting and BoN, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$.
Score: 54.50689440956967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of- N (BoN). For difficult tasks, this single-shot selection often underperforms. Consequently, evaluations commonly report Pass@$k$: the agent may submit up to $k$ responses, and only the best of them is used when computing regret. Motivated by this, we study inference scaling in the more general Pass@$k$ inference setting, and prove that neither majority voting nor BoN exhibits the desirable scaling with $k$ and the sampling budget $N$. Combining the advantages of majority voting and BoN, we propose a new inference strategy called Best-of-Majority (BoM), with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards. We prove that when the sampling budget is $N=\tilde\Omega(C^*)$, the regret of BoM is $O(\epsilon_{\mathrm{opt}}+\sqrt{\epsilon_{\mathrm{RM}}^2C^*/k})$, where $C^*$ is the coverage coefficient, $\epsilon_{\mathrm{RM}}$ is the estimation error of the reward model, and $\epsilon_{\mathrm{opt}}$ is the estimation error of reward at the optimal response. We further establish a matching lower bound, certifying that our algorithm is minimax optimal. Beyond optimality, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$. Experimental results of inference on math problems show BoM outperforming both majority voting and BoN.

Related papers

Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z)
Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds [20.024434010891433]
We refer to this problem as Bayesian Optimization from Human Feedback (HF), which differs from the best actions from a reduced feedback model.<n>The objective is to identify the best action using a limited preference framework.<n>In other words, the same number of preferential emerging samples as scalar-valued samples is sufficient to find a nearly optimal solution.
arXiv Detail & Related papers (2025-05-29T17:17:29Z)
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two algorithms that enjoy provable scaling laws for the test-time compute of large language models.<n>One is a two-stage knockout-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.<n>The other is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [16.38043428743923]
This paper concerns the problem of aligning samples from large language models to human preferences using best-of-$n$ sampling. We show that best-of-$n$ is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model. Experiments show that BoNBoN alignment yields substantial improvements in producing a model that is preferred to the base policy.
arXiv Detail & Related papers (2024-06-02T18:42:57Z)
Nearly Minimax Optimal Regret for Multinomial Logistic Bandit [8.087699764574788]
We study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information.<n>There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size.<n>We propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $tildeO(dsqrtT/K)$ under uniform rewards.
arXiv Detail & Related papers (2024-05-16T06:07:31Z)
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum [7.095058159492494]
Heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical results over gradient descent.<n>We show that SHB can attain an accelerated mini-batch size when the mini-batch size is larger than a threshold $b*$ that on the condition number $kappa2$.
arXiv Detail & Related papers (2024-01-12T18:17:28Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Regret Minimization with Noisy Observations [18.24975895643299]
In a typical optimization problem, the task is to pick one of a number of options with the lowest cost or the highest value. In this paper, we study such scenarios using a minimization regret model. We show that the na"ive algorithm of picking the highest observed value has regret arbitrarily worse than the optimum.
arXiv Detail & Related papers (2022-07-19T17:48:54Z)
Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP) Our objective is to return a policy that achieves large expected reward with a small constraint violation. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z)
Maillard Sampling: Boltzmann Exploration Done Optimally [11.282341369957216]
This thesis presents a randomized algorithm for the $K$-armed bandit problem. Maillard sampling (MS) computes the probability of choosing each arm in a closed form. We propose a variant of MS called MS$+$ that improves its minimax bound to $sqrtKTlogK$ without losing the optimality.
arXiv Detail & Related papers (2021-11-05T06:50:22Z)
Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations. This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation. We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.