Soft Best-of-n Sampling for Model Alignment
- URL: http://arxiv.org/abs/2505.03156v1
- Date: Tue, 06 May 2025 04:03:11 GMT
- Title: Soft Best-of-n Sampling for Model Alignment
- Authors: Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio P. Calmon,
- Abstract summary: Best-of-$n$ sampling is a practical approach for aligning language model outputs with human preferences.<n>We introduce Soft Best-of-$n$ sampling, which allows for smooth generalization between the original distribution and reward-maximizing distribution.<n>For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
- Score: 19.80655819384635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Best-of-$n$ (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating $n$ responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger $n$ yields a higher reward at a higher distortion cost. We introduce Soft Best-of-$n$ sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter $\lambda$. We establish theoretical guarantees showing that Soft Best-of-$n$ sampling converges sharply to the optimal tilted distribution at a rate of $O(1/n)$ in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
Related papers
- Wedge Sampling: Efficient Tensor Completion with Nearly-Linear Sample Complexity [9.42598427201735]
We introduce Wedge Sampling, a new non-adaptive sampling scheme for low-rank tensor completion.<n>We study recovery of an order-$k low-rank tensor of dimension $n times cdots times n$ from a subset of its entries.
arXiv Detail & Related papers (2026-02-05T16:47:13Z) - Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models [28.29554194279748]
Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent.<n>This paper studies a test-time scaling method that enables sampling from regions with higher human-aligned reward values.
arXiv Detail & Related papers (2026-02-03T07:27:27Z) - CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning [62.56541355300587]
We introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths.<n>Within this framework, we propose CarBoN, a two-phase method that first explores the solution space and then learns a calibration of the logits.<n>Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4times$ fewer rollouts to reach the same accuracy.
arXiv Detail & Related papers (2025-10-17T14:04:37Z) - Learn to Guide Your Diffusion Model [84.82855046749657]
We study a technique for improving quality of samples from conditional diffusion models.<n>We learn guidance weights $omega_c,(s,t)$, which are functions of the conditioning $c$, the time $t$ from which we denoise, and the time $s$ towards which we denoise.<n>We extend our framework to reward guided sampling, enabling the model to target distributions tilted by a reward function.
arXiv Detail & Related papers (2025-10-01T12:21:48Z) - p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding [10.595336643423229]
$p$-less sampling is an information-theoretic approach to sampling.<n>It dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution.<n>It consistently produces high-quality outputs as temperature increases.
arXiv Detail & Related papers (2025-09-27T10:33:41Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models [13.312007032203857]
Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling.<n>We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain.<n>By reusing information from previous generations, we get an anytime algorithm that turns additional compute into steadily better samples.
arXiv Detail & Related papers (2025-06-25T17:59:10Z) - Contextual Learning for Stochastic Optimization [1.0819408603463425]
Motivated by optimization, we introduce the problem of learning from samples of contextual value distributions.<n>A contextual value distribution can be understood as a family of real-valued distributions, where each sample consists of a context $x$ and a random variable drawn from the corresponding real-valued distribution $D_x$.
arXiv Detail & Related papers (2025-05-22T16:01:49Z) - Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization [66.67988187816185]
We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance.<n>Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases.<n>We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
arXiv Detail & Related papers (2025-02-24T04:22:57Z) - Diffusion at Absolute Zero: Langevin Sampling Using Successive Moreau Envelopes [conference paper] [52.69179872700035]
We propose a novel method for sampling from Gibbs distributions of the form $pi(x)proptoexp(-U(x))$ with a potential $U(x)$.<n>Inspired by diffusion models we propose to consider a sequence $(pit_k)_k$ of approximations of the target density, for which $pit_kapprox pi$ for $k$ small and, on the other hand, $pit_k$ exhibits favorable properties for sampling for $k$ large.
arXiv Detail & Related papers (2025-02-03T13:50:57Z) - BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [16.38043428743923]
This paper concerns the problem of aligning samples from large language models to human preferences using best-of-$n$ sampling.
We show that best-of-$n$ is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model.
Experiments show that BoNBoN alignment yields substantial improvements in producing a model that is preferred to the base policy.
arXiv Detail & Related papers (2024-06-02T18:42:57Z) - Asymptotics of Language Model Alignment [27.37118975691123]
We show that the optimal KL-constrained RL solution satisfies a large deviation principle.
We also show that the rate of growth of the scaled cumulants of the reward is characterized by proper Renyi cross entropy.
arXiv Detail & Related papers (2024-04-02T08:40:07Z) - Optimal Budgeted Rejection Sampling for Generative Models [54.050498411883495]
Rejection sampling methods have been proposed to improve the performance of discriminator-based generative models.
We first propose an Optimal Budgeted Rejection Sampling scheme that is provably optimal.
Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model's overall performance.
arXiv Detail & Related papers (2023-11-01T11:52:41Z) - Variational Refinement for Importance Sampling Using the Forward
Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference.
Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures.
We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z) - Towards Sample-Optimal Compressive Phase Retrieval with Sparse and
Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function.
We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.