Soft Best-of-n Sampling for Model Alignment
- URL: http://arxiv.org/abs/2505.03156v1
- Date: Tue, 06 May 2025 04:03:11 GMT
- Title: Soft Best-of-n Sampling for Model Alignment
- Authors: Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio P. Calmon,
- Abstract summary: Best-of-$n$ sampling is a practical approach for aligning language model outputs with human preferences.<n>We introduce Soft Best-of-$n$ sampling, which allows for smooth generalization between the original distribution and reward-maximizing distribution.<n>For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
- Score: 19.80655819384635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Best-of-$n$ (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating $n$ responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger $n$ yields a higher reward at a higher distortion cost. We introduce Soft Best-of-$n$ sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter $\lambda$. We establish theoretical guarantees showing that Soft Best-of-$n$ sampling converges sharply to the optimal tilted distribution at a rate of $O(1/n)$ in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
Related papers
- Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models [13.312007032203857]
Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling.<n>We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain.<n>By reusing information from previous generations, we get an anytime algorithm that turns additional compute into steadily better samples.
arXiv Detail & Related papers (2025-06-25T17:59:10Z) - Contextual Learning for Stochastic Optimization [1.0819408603463425]
Motivated by optimization, we introduce the problem of learning from samples of contextual value distributions.<n>A contextual value distribution can be understood as a family of real-valued distributions, where each sample consists of a context $x$ and a random variable drawn from the corresponding real-valued distribution $D_x$.
arXiv Detail & Related papers (2025-05-22T16:01:49Z) - Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization [66.67988187816185]
We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance.<n>Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases.<n>We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
arXiv Detail & Related papers (2025-02-24T04:22:57Z) - Diffusion at Absolute Zero: Langevin Sampling Using Successive Moreau Envelopes [conference paper] [52.69179872700035]
We propose a novel method for sampling from Gibbs distributions of the form $pi(x)proptoexp(-U(x))$ with a potential $U(x)$.<n>Inspired by diffusion models we propose to consider a sequence $(pit_k)_k$ of approximations of the target density, for which $pit_kapprox pi$ for $k$ small and, on the other hand, $pit_k$ exhibits favorable properties for sampling for $k$ large.
arXiv Detail & Related papers (2025-02-03T13:50:57Z) - BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [16.38043428743923]
This paper concerns the problem of aligning samples from large language models to human preferences using best-of-$n$ sampling.
We show that best-of-$n$ is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model.
Experiments show that BoNBoN alignment yields substantial improvements in producing a model that is preferred to the base policy.
arXiv Detail & Related papers (2024-06-02T18:42:57Z) - Asymptotics of Language Model Alignment [27.37118975691123]
We show that the optimal KL-constrained RL solution satisfies a large deviation principle.
We also show that the rate of growth of the scaled cumulants of the reward is characterized by proper Renyi cross entropy.
arXiv Detail & Related papers (2024-04-02T08:40:07Z) - Optimal Budgeted Rejection Sampling for Generative Models [54.050498411883495]
Rejection sampling methods have been proposed to improve the performance of discriminator-based generative models.
We first propose an Optimal Budgeted Rejection Sampling scheme that is provably optimal.
Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model's overall performance.
arXiv Detail & Related papers (2023-11-01T11:52:41Z) - Variational Refinement for Importance Sampling Using the Forward
Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference.
Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures.
We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z) - Towards Sample-Optimal Compressive Phase Retrieval with Sparse and
Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function.
We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.