BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling
- URL: http://arxiv.org/abs/2406.00832v3
- Date: Fri, 01 Nov 2024 20:02:32 GMT
- Title: BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling
- Authors: Lin Gui, Cristina Gârbacea, Victor Veitch,
- Abstract summary: This paper concerns the problem of aligning samples from large language models to human preferences using best-of-$n$ sampling.
We show that best-of-$n$ is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model.
Experiments show that BoNBoN alignment yields substantial improvements in producing a model that is preferred to the base policy.
- Score: 16.38043428743923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper concerns the problem of aligning samples from large language models to human preferences using best-of-$n$ sampling, where we draw $n$ samples, rank them, and return the best one. We consider two fundamental problems. First: what is the relationship between best-of-$n$ and approaches to alignment that train LLMs to output samples with a high expected reward (e.g., RLHF or DPO)? To answer this, we embed both the best-of-$n$ distribution and the sampling distributions learned by alignment procedures in a common class of tiltings of the base LLM distribution. We then show that, within this class, best-of-$n$ is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model. That is, best-of-$n$ is the best choice of alignment distribution if the goal is to maximize win rate. However, best-of-$n$ requires drawing $n$ samples for each inference, a substantial cost. To avoid this, the second problem we consider is how to fine-tune a LLM to mimic the best-of-$n$ sampling distribution. We derive BoNBoN Alignment to achieve this by exploiting the special structure of the best-of-$n$ distribution. Experiments show that BoNBoN alignment yields substantial improvements in producing a model that is preferred to the base policy while minimally affecting off-target aspects.
Related papers
- Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences? [20.004349891563706]
After pre-training, large language models are aligned with human preferences based on pairwise comparisons.<n>We introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy.
arXiv Detail & Related papers (2025-05-29T17:59:20Z) - Soft Best-of-n Sampling for Model Alignment [19.80655819384635]
Best-of-$n$ sampling is a practical approach for aligning language model outputs with human preferences.<n>We introduce Soft Best-of-$n$ sampling, which allows for smooth generalization between the original distribution and reward-maximizing distribution.<n>For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
arXiv Detail & Related papers (2025-05-06T04:03:11Z) - Improving LLM General Preference Alignment via Optimistic Online Mirror Descent [57.622821649679786]
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences.
In this paper, we drop the Bradley-Terry (BT) model assumption and study LLM alignment under general preferences, formulated as a two-player game.
We show that our approach achieves an $O(T-1)$ bound on the duality gap, improving upon the previous $O(T-1/2)$ result.
arXiv Detail & Related papers (2025-02-24T05:24:52Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - Distributional Preference Alignment of LLMs via Optimal Transport [36.95053112313244]
We propose a novel method for distributional preference alignment of LLMs called Alignment via Optimal Transport (AOT)
AOT aligns LLMs on unpaired preference data by making the reward distribution of the positive samplesally dominant in the first order on the distribution of negative samples.
We show that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and AlpacaEval.
arXiv Detail & Related papers (2024-06-09T18:41:05Z) - Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model.
Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z) - Asymptotics of Language Model Alignment [27.37118975691123]
We show that the optimal KL-constrained RL solution satisfies a large deviation principle.
We also show that the rate of growth of the scaled cumulants of the reward is characterized by proper Renyi cross entropy.
arXiv Detail & Related papers (2024-04-02T08:40:07Z) - Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions [11.222970035173372]
kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1 t-fracd+22(tfracd2 vee 1)right)
We show that a kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1/2 t-fracd4right)$ upper bound for the total variation error of the distribution of the sample generated by the diffusion model under a mere sub-Gaussian
arXiv Detail & Related papers (2024-02-23T20:51:31Z) - Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF)<n>We show that uniform sampling of contexts could lead to a policy that suffers a constant sub-optimality gap from the optimal policy.<n>We propose an algorithm, $textttAPO$, that iteratively collects preferences for the most uncertain contexts.
arXiv Detail & Related papers (2024-02-16T08:19:34Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the base policy is equal to $log (n) - (n-1)/n.$
We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Stochastic Approximation Approaches to Group Distributionally Robust Optimization and Beyond [89.72693227960274]
This paper investigates group distributionally robust optimization (GDRO) with the goal of learning a model that performs well over $m$ different distributions.
To reduce the number of samples in each round from $m$ to 1, we cast GDRO as a two-player game, where one player conducts and the other executes an online algorithm for non-oblivious multi-armed bandits.
In the second scenario, we propose to optimize the average top-$k$ risk instead of the maximum risk, thereby mitigating the impact of distributions.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.