Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis
- URL: http://arxiv.org/abs/2507.05913v1
- Date: Tue, 08 Jul 2025 11:59:48 GMT
- Title: Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis
- Authors: Gholamali Aminian, Idan Shenfeld, Amir R. Asadi, Ahmad Beirami, Youssef Mroueh,
- Abstract summary: Best-of-$N$ (BoN) is a method for inference-time alignment of generative models.<n>We study BoN through a smooth version known as Soft Best-of-N (SBoN)<n>Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization.
- Score: 23.76662251965668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A simple yet effective method for inference-time alignment of generative models is Best-of-$N$ (BoN), where $N$ outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that BoN is almost optimal in reward vs KL tradeoffs, the effectiveness of BoN depends critically on the quality of the proxy reward model used for selection. For this purpose, we study BoN through a smooth version known as Soft Best-of-N (SBoN) and develop a theoretical framework to address this gap. We analyze the scaling behaviour of BoN by providing bounds on the KL divergence between the SBoN policy and the reference policy, offering insights into how performance varies with the number of samples. We also study the regret gap, i.e., the gap between the expected true reward under the optimal policy and the SBoN policy. Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization, especially when the quality of the proxy reward is low.
Related papers
- Soft Best-of-n Sampling for Model Alignment [19.80655819384635]
Best-of-$n$ sampling is a practical approach for aligning language model outputs with human preferences.<n>We introduce Soft Best-of-$n$ sampling, which allows for smooth generalization between the original distribution and reward-maximizing distribution.<n>For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
arXiv Detail & Related papers (2025-05-06T04:03:11Z) - Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.<n>We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.<n>We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z) - Evaluation of Best-of-N Sampling Strategies for Language Model Alignment [6.4706370001155955]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding.<n>Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling.<n>This paper proposes an extension of the RBoN framework, called RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN proxy reward.
arXiv Detail & Related papers (2025-02-18T09:18:02Z) - BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time.
Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution.
We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z) - Variational Best-of-N Alignment [57.617866305771756]
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.<n>We propose to fine-tune the language model to mimic what BoN does during inference.<n>Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
arXiv Detail & Related papers (2024-07-08T15:59:44Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data.<n>We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs.<n>Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment [7.349727826230864]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.<n>Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective.<n>We propose a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term.
arXiv Detail & Related papers (2024-04-01T11:26:50Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.<n>We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.<n>We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Thompson Exploration with Best Challenger Rule in Best Arm Identification [59.02170783023547]
We study the fixed-confidence best arm identification problem in the bandit framework.<n>We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.