Related papers: Variational Best-of-N Alignment

Variational Best-of-N Alignment

URL: http://arxiv.org/abs/2407.06057v1
Date: Mon, 8 Jul 2024 15:59:44 GMT
Title: Variational Best-of-N Alignment
Authors: Afra Amini, Tim Vieira, Ryan Cotterell,
Abstract summary: Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. We propose to fine-tune the language model to mimic what BoN does during inference. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
Score: 58.7977683502207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on a controlled generation task suggest that while variational BoN is not as effective as BoN in aligning language models, it is close to BoN performance as vBoN appears more often on the Pareto frontier of reward and KL divergence compared to models trained with KL-constrained RL objective.

Related papers

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding [64.2888389315149]
Test-time scaling improves large language model performance by adding extra compute during decoding. Best-of-N sampling serves as a common scaling technique, broadening the search space for finding better solutions. We propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings.
arXiv Detail & Related papers (2025-03-03T11:21:01Z)
Evaluation of Best-of-N Sampling Strategies for Language Model Alignment [6.4706370001155955]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling. This paper proposes an extension of the RBoN framework, called RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN proxy reward.
arXiv Detail & Related papers (2025-02-18T09:18:02Z)
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [80.65242356955231]
We propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimize the performance of the inference-time strategy. We devise the first imitation learning and reinforcement learning(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute.
arXiv Detail & Related papers (2024-12-18T20:43:47Z)
Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment. We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z)
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [39.019269570224004]
Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling.
arXiv Detail & Related papers (2024-10-18T04:38:21Z)
BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z)
Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment [7.349727826230864]
We propose Regularized Best-of-N (RBoN) to mitigate reward hacking. RBoN incorporates a proximity term in response selection, similar to preference learning techniques. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN.
arXiv Detail & Related papers (2024-04-01T11:26:50Z)
Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO) Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z)
Predictive Modeling through Hyper-Bayesian Optimization [60.586813904500595]
We propose a novel way of integrating model selection and BO for the single goal of reaching the function optima faster. The algorithm moves back and forth between BO in the model space and BO in the function space, where the goodness of the recommended model is captured. In addition to improved sample efficiency, the framework outputs information about the black-box function.
arXiv Detail & Related papers (2023-08-01T04:46:58Z)
Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy. To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN. Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z)
Bayesian Neural Networks With Maximum Mean Discrepancy Regularization [13.97417198693205]
We show that our BNNs achieve higher accuracy on multiple benchmarks, including several image classification tasks. We also provide a new formulation for estimating the uncertainty on a given prediction, showing it performs in a more robust fashion against adversarial attacks.
arXiv Detail & Related papers (2020-03-02T14:54:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.