Variational Best-of-N Alignment
- URL: http://arxiv.org/abs/2407.06057v1
- Date: Mon, 8 Jul 2024 15:59:44 GMT
- Title: Variational Best-of-N Alignment
- Authors: Afra Amini, Tim Vieira, Ryan Cotterell,
- Abstract summary: Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.
We propose to fine-tune the language model to mimic what BoN does during inference.
Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
- Score: 58.7977683502207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on a controlled generation task suggest that while variational BoN is not as effective as BoN in aligning language models, it is close to BoN performance as vBoN appears more often on the Pareto frontier of reward and KL divergence compared to models trained with KL-constrained RL objective.
Related papers
- Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [39.019269570224004]
Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning.
Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one.
We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling.
arXiv Detail & Related papers (2024-10-18T04:38:21Z) - BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time.
Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution.
We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z) - Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment [7.349727826230864]
We propose Regularized Best-of-N (RBoN) to mitigate reward hacking.
RBoN incorporates a proximity term in response selection, similar to preference learning techniques.
Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN.
arXiv Detail & Related papers (2024-04-01T11:26:50Z) - Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO)
Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z) - Predictive Modeling through Hyper-Bayesian Optimization [60.586813904500595]
We propose a novel way of integrating model selection and BO for the single goal of reaching the function optima faster.
The algorithm moves back and forth between BO in the model space and BO in the function space, where the goodness of the recommended model is captured.
In addition to improved sample efficiency, the framework outputs information about the black-box function.
arXiv Detail & Related papers (2023-08-01T04:46:58Z) - Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy.
To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN.
Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z) - Bayesian Neural Networks With Maximum Mean Discrepancy Regularization [13.97417198693205]
We show that our BNNs achieve higher accuracy on multiple benchmarks, including several image classification tasks.
We also provide a new formulation for estimating the uncertainty on a given prediction, showing it performs in a more robust fashion against adversarial attacks.
arXiv Detail & Related papers (2020-03-02T14:54:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.