Related papers: Majority of the Bests: Improving Best-of-N via Bootstrapping

Majority of the Bests: Improving Best-of-N via Bootstrapping

URL: http://arxiv.org/abs/2511.18630v1
Date: Sun, 23 Nov 2025 22:05:08 GMT
Title: Majority of the Bests: Improving Best-of-N via Bootstrapping
Authors: Amin Rakhsha, Kanika Madan, Tianyu Zhang, Amir-massoud Farahmand, Amir Khasahmadi,
Abstract summary: Majority-of-the-Bests (MoB) is a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode.<n>MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.
Score: 14.223905735887143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN's outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping. MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.

Related papers

Learning Generative Selection for Best-of-N [52.88943295436412]
We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning.<n>Our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models.
arXiv Detail & Related papers (2026-02-02T14:21:15Z)
Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling [54.50689440956967]
LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of-N (BoN)<n>We propose Best-of-Majority (BoM) with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards.<n>Unlike majority voting and BoN, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$.
arXiv Detail & Related papers (2025-10-03T17:35:45Z)
Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis [23.76662251965668]
Best-of-$N$ (BoN) is a method for inference-time alignment of generative models.<n>We study BoN through a smooth version known as Soft Best-of-N (SBoN)<n>Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization.
arXiv Detail & Related papers (2025-07-08T11:59:48Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [75.1351701045874]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs)<n>We propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
Evaluation of Best-of-N Sampling Strategies for Language Model Alignment [6.4706370001155955]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding.<n>Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling.<n>This paper proposes an extension of the RBoN framework, called RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN proxy reward.
arXiv Detail & Related papers (2025-02-18T09:18:02Z)
BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z)
Variational Best-of-N Alignment [57.617866305771756]
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.<n>We propose to fine-tune the language model to mimic what BoN does during inference.<n>Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
arXiv Detail & Related papers (2024-07-08T15:59:44Z)
Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment [7.349727826230864]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.<n>Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective.<n>We propose a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term.
arXiv Detail & Related papers (2024-04-01T11:26:50Z)
Distributionally Robust Bayesian Quadrature Optimization [60.383252534861136]
We study BQO under distributional uncertainty in which the underlying probability distribution is unknown except for a limited set of its i.i.d. samples. A standard BQO approach maximizes the Monte Carlo estimate of the true expected objective given the fixed sample set. We propose a novel posterior sampling based algorithm, namely distributionally robust BQO (DRBQO) for this purpose.
arXiv Detail & Related papers (2020-01-19T12:00:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.