Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling
- URL: http://arxiv.org/abs/2510.03199v1
- Date: Fri, 03 Oct 2025 17:35:45 GMT
- Title: Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling
- Authors: Qiwei Di, Kaixuan Ji, Xuheng Li, Heyang Zhao, Quanquan Gu,
- Abstract summary: LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of-N (BoN)<n>We propose Best-of-Majority (BoM) with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards.<n>Unlike majority voting and BoN, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$.
- Score: 54.50689440956967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of- N (BoN). For difficult tasks, this single-shot selection often underperforms. Consequently, evaluations commonly report Pass@$k$: the agent may submit up to $k$ responses, and only the best of them is used when computing regret. Motivated by this, we study inference scaling in the more general Pass@$k$ inference setting, and prove that neither majority voting nor BoN exhibits the desirable scaling with $k$ and the sampling budget $N$. Combining the advantages of majority voting and BoN, we propose a new inference strategy called Best-of-Majority (BoM), with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards. We prove that when the sampling budget is $N=\tilde\Omega(C^*)$, the regret of BoM is $O(\epsilon_{\mathrm{opt}}+\sqrt{\epsilon_{\mathrm{RM}}^2C^*/k})$, where $C^*$ is the coverage coefficient, $\epsilon_{\mathrm{RM}}$ is the estimation error of the reward model, and $\epsilon_{\mathrm{opt}}$ is the estimation error of reward at the optimal response. We further establish a matching lower bound, certifying that our algorithm is minimax optimal. Beyond optimality, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$. Experimental results of inference on math problems show BoM outperforming both majority voting and BoN.
Related papers
- Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z) - Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds [20.024434010891433]
We refer to this problem as Bayesian Optimization from Human Feedback (HF), which differs from the best actions from a reduced feedback model.<n>The objective is to identify the best action using a limited preference framework.<n>In other words, the same number of preferential emerging samples as scalar-valued samples is sufficient to find a nearly optimal solution.
arXiv Detail & Related papers (2025-05-29T17:17:29Z) - Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two algorithms that enjoy provable scaling laws for the test-time compute of large language models.<n>One is a two-stage knockout-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.<n>The other is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.
arXiv Detail & Related papers (2024-11-29T05:29:47Z) - BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [16.38043428743923]
This paper concerns the problem of aligning samples from large language models to human preferences using best-of-$n$ sampling.
We show that best-of-$n$ is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model.
Experiments show that BoNBoN alignment yields substantial improvements in producing a model that is preferred to the base policy.
arXiv Detail & Related papers (2024-06-02T18:42:57Z) - Nearly Minimax Optimal Regret for Multinomial Logistic Bandit [8.087699764574788]
We study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information.<n>There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size.<n>We propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $tildeO(dsqrtT/K)$ under uniform rewards.
arXiv Detail & Related papers (2024-05-16T06:07:31Z) - (Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum [7.095058159492494]
Heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical results over gradient descent.<n>We show that SHB can attain an accelerated mini-batch size when the mini-batch size is larger than a threshold $b*$ that on the condition number $kappa2$.
arXiv Detail & Related papers (2024-01-12T18:17:28Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Regret Minimization with Noisy Observations [18.24975895643299]
In a typical optimization problem, the task is to pick one of a number of options with the lowest cost or the highest value.
In this paper, we study such scenarios using a minimization regret model.
We show that the na"ive algorithm of picking the highest observed value has regret arbitrarily worse than the optimum.
arXiv Detail & Related papers (2022-07-19T17:48:54Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Maillard Sampling: Boltzmann Exploration Done Optimally [11.282341369957216]
This thesis presents a randomized algorithm for the $K$-armed bandit problem.
Maillard sampling (MS) computes the probability of choosing each arm in a closed form.
We propose a variant of MS called MS$+$ that improves its minimax bound to $sqrtKTlogK$ without losing the optimality.
arXiv Detail & Related papers (2021-11-05T06:50:22Z) - Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations.
This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation.
We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.