Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment
- URL: http://arxiv.org/abs/2404.01054v4
- Date: Wed, 29 Jan 2025 08:52:40 GMT
- Title: Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment
- Authors: Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe,
- Abstract summary: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.
Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective.
We propose a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term.
- Score: 7.349727826230864
- License:
- Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking when the accuracy of the reward model is not high enough due to the quality or the quantity of the preference dataset. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. In this research, we propose MBR-BoN, a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term. We show empirically and analytically that the MBR objective quantifies the proximity of the response to the reference policy, serving as a proximity regularizer. We evaluate MBR-BoN on the AlpacaFarm and Anthropic's hh-rlhf datasets and show that it outperforms both BoN sampling and MBR decoding. We also evaluate MBR-BoN to generate a pairwise preference learning dataset for Direct Preference Optimization (DPO). Empirical results show that models trained on a dataset generated with MBR-BoN outperform those with vanilla BoN. Our code is available at https://github.com/CyberAgentAILab/regularized-bon
Related papers
- Evaluation of Best-of-N Sampling Strategies for Language Model Alignment [6.4706370001155955]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding.
Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling.
This paper proposes an extension of the RBoN framework, called RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN proxy reward.
arXiv Detail & Related papers (2025-02-18T09:18:02Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.
Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.
This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator [32.05337749590184]
We develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions.
We then select contrastive divergence (CD) as sampling strategy, and propose a novel MC-PO algorithm.
OnMC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
arXiv Detail & Related papers (2025-02-06T23:45:08Z) - Robust Bayesian Optimization via Localized Online Conformal Prediction [37.549297668783254]
We introduce localized online conformal prediction-based Bayesian optimization (LOCBO)
LOCBO calibrates the GP model through localized online conformal prediction (CP)
We provide theoretical performance guarantees for LOCBO's iterates that hold for the unobserved objective function.
arXiv Detail & Related papers (2024-11-26T12:45:54Z) - Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs)
We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback.
Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z) - BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time.
Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution.
We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z) - Variational Best-of-N Alignment [58.7977683502207]
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.
We propose to fine-tune the language model to mimic what BoN does during inference.
Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
arXiv Detail & Related papers (2024-07-08T15:59:44Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO)
Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.