Related papers: Soft Best-of-n Sampling for Model Alignment

Soft Best-of-n Sampling for Model Alignment

URL: http://arxiv.org/abs/2505.03156v1
Date: Tue, 06 May 2025 04:03:11 GMT
Title: Soft Best-of-n Sampling for Model Alignment
Authors: Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio P. Calmon,
Abstract summary: Best-of-$n$ sampling is a practical approach for aligning language model outputs with human preferences.<n>We introduce Soft Best-of-$n$ sampling, which allows for smooth generalization between the original distribution and reward-maximizing distribution.<n>For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
Score: 19.80655819384635
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Best-of-$n$ (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating $n$ responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger $n$ yields a higher reward at a higher distortion cost. We introduce Soft Best-of-$n$ sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter $\lambda$. We establish theoretical guarantees showing that Soft Best-of-$n$ sampling converges sharply to the optimal tilted distribution at a rate of $O(1/n)$ in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.

Related papers

Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models [13.312007032203857]
Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling.<n>We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain.<n>By reusing information from previous generations, we get an anytime algorithm that turns additional compute into steadily better samples.
arXiv Detail & Related papers (2025-06-25T17:59:10Z)
Contextual Learning for Stochastic Optimization [1.0819408603463425]
Motivated by optimization, we introduce the problem of learning from samples of contextual value distributions.<n>A contextual value distribution can be understood as a family of real-valued distributions, where each sample consists of a context $x$ and a random variable drawn from the corresponding real-valued distribution $D_x$.
arXiv Detail & Related papers (2025-05-22T16:01:49Z)
Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization [66.67988187816185]
We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance.<n>Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases.<n>We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
arXiv Detail & Related papers (2025-02-24T04:22:57Z)
Diffusion at Absolute Zero: Langevin Sampling Using Successive Moreau Envelopes [conference paper] [52.69179872700035]
We propose a novel method for sampling from Gibbs distributions of the form $pi(x)proptoexp(-U(x))$ with a potential $U(x)$.<n>Inspired by diffusion models we propose to consider a sequence $(pit_k)_k$ of approximations of the target density, for which $pit_kapprox pi$ for $k$ small and, on the other hand, $pit_k$ exhibits favorable properties for sampling for $k$ large.
arXiv Detail & Related papers (2025-02-03T13:50:57Z)
BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [16.38043428743923]
This paper concerns the problem of aligning samples from large language models to human preferences using best-of-$n$ sampling. We show that best-of-$n$ is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model. Experiments show that BoNBoN alignment yields substantial improvements in producing a model that is preferred to the base policy.
arXiv Detail & Related papers (2024-06-02T18:42:57Z)
Asymptotics of Language Model Alignment [27.37118975691123]
We show that the optimal KL-constrained RL solution satisfies a large deviation principle. We also show that the rate of growth of the scaled cumulants of the reward is characterized by proper Renyi cross entropy.
arXiv Detail & Related papers (2024-04-02T08:40:07Z)
Optimal Budgeted Rejection Sampling for Generative Models [54.050498411883495]
Rejection sampling methods have been proposed to improve the performance of discriminator-based generative models. We first propose an Optimal Budgeted Rejection Sampling scheme that is provably optimal. Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model's overall performance.
arXiv Detail & Related papers (2023-11-01T11:52:41Z)
Variational Refinement for Importance Sampling Using the Forward Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference. Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures. We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z)
Towards Sample-Optimal Compressive Phase Retrieval with Sparse and Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function. We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z)
The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$. In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples. We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.