Related papers: Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

URL: http://arxiv.org/abs/2404.01054v3
Date: Mon, 24 Jun 2024 02:31:06 GMT
Title: Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
Authors: Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe,
Abstract summary: We propose Regularized Best-of-N (RBoN) to mitigate reward hacking. RBoN incorporates a proximity term in response selection, similar to preference learning techniques. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN.
Score: 7.349727826230864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the language model remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate RBoN on the AlpacaFarm and Anthropic's hh-rlhf datasets and find that it outperforms BoN. As an application of RBoN, we use RBoN to generate a pairwise preference learning dataset. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN. Our code is available at https://github.com/CyberAgentAILab/regularized-bon

Related papers

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding [64.2888389315149]
Test-time scaling improves large language model performance by adding extra compute during decoding. Best-of-N sampling serves as a common scaling technique, broadening the search space for finding better solutions. We propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings.
arXiv Detail & Related papers (2025-03-03T11:21:01Z)
Evaluation of Best-of-N Sampling Strategies for Language Model Alignment [6.4706370001155955]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling. This paper proposes an extension of the RBoN framework, called RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN proxy reward.
arXiv Detail & Related papers (2025-02-18T09:18:02Z)
Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator [32.05337749590184]
We develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. We then select contrastive divergence (CD) as sampling strategy, and propose a novel MC-PO algorithm. OnMC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
arXiv Detail & Related papers (2025-02-06T23:45:08Z)
Robust Bayesian Optimization via Localized Online Conformal Prediction [37.549297668783254]
We introduce localized online conformal prediction-based Bayesian optimization (LOCBO) LOCBO calibrates the GP model through localized online conformal prediction (CP) We provide theoretical performance guarantees for LOCBO's iterates that hold for the unobserved objective function.
arXiv Detail & Related papers (2024-11-26T12:45:54Z)
Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback. We frame the selection of an effective dataset as a simple regret minimization task. We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z)
Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss. OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z)
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs) We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback. Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z)
BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z)
Variational Best-of-N Alignment [58.7977683502207]
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. We propose to fine-tune the language model to mimic what BoN does during inference. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
arXiv Detail & Related papers (2024-07-08T15:59:44Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO) Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
West-of-N: Synthetic Preferences for Self-Improving Reward Models [20.643537269666137]
We present a novel approach to improve reward model quality by generating synthetic preference data. We find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data.
arXiv Detail & Related papers (2024-01-22T16:24:43Z)
Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it. By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z)
Exploring validation metrics for offline model-based optimisation with diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.