Variational Best-of-N Alignment
- URL: http://arxiv.org/abs/2407.06057v1
- Date: Mon, 8 Jul 2024 15:59:44 GMT
- Title: Variational Best-of-N Alignment
- Authors: Afra Amini, Tim Vieira, Ryan Cotterell,
- Abstract summary: Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.
We propose to fine-tune the language model to mimic what BoN does during inference.
Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
- Score: 58.7977683502207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on a controlled generation task suggest that while variational BoN is not as effective as BoN in aligning language models, it is close to BoN performance as vBoN appears more often on the Pareto frontier of reward and KL divergence compared to models trained with KL-constrained RL objective.
Related papers
- Evaluation of Best-of-N Sampling Strategies for Language Model Alignment [6.4706370001155955]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding.
Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling.
This paper proposes an extension of the RBoN framework, called RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN proxy reward.
arXiv Detail & Related papers (2025-02-18T09:18:02Z) - Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [80.65242356955231]
We propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimize the performance of the inference-time strategy.
We devise the first imitation learning and reinforcement learning(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN.
Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute.
arXiv Detail & Related papers (2024-12-18T20:43:47Z) - Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [39.019269570224004]
Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning.
Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one.
We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling.
arXiv Detail & Related papers (2024-10-18T04:38:21Z) - BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time.
Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution.
We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z) - Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment [7.349727826230864]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.
Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective.
We propose a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term.
arXiv Detail & Related papers (2024-04-01T11:26:50Z) - Predictive Modeling through Hyper-Bayesian Optimization [60.586813904500595]
We propose a novel way of integrating model selection and BO for the single goal of reaching the function optima faster.
The algorithm moves back and forth between BO in the model space and BO in the function space, where the goodness of the recommended model is captured.
In addition to improved sample efficiency, the framework outputs information about the black-box function.
arXiv Detail & Related papers (2023-08-01T04:46:58Z) - Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy.
To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN.
Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.