Reward Model Ensembles Help Mitigate Overoptimization
- URL: http://arxiv.org/abs/2310.02743v2
- Date: Sun, 10 Mar 2024 16:14:58 GMT
- Title: Reward Model Ensembles Help Mitigate Overoptimization
- Authors: Thomas Coste, Usman Anwar, Robert Kirk, David Krueger
- Abstract summary: Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions.
As imperfect representations of the "true" reward, learned reward models are susceptible to overoptimization.
- Score: 7.715463015544845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning from human feedback (RLHF) is a standard approach for
fine-tuning large language models to follow instructions. As part of this
process, learned reward models are used to approximately model human
preferences. However, as imperfect representations of the "true" reward, these
learned reward models are susceptible to overoptimization. Gao et al. (2023)
studied this phenomenon in a synthetic human feedback setup with a
significantly larger "gold" reward model acting as the true reward (instead of
humans) and showed that overoptimization remains a persistent problem
regardless of the size of the proxy reward model and training data used. Using
a similar setup, we conduct a systematic study to evaluate the efficacy of
using ensemble-based conservative optimization objectives, specifically
worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for
mitigating reward model overoptimization when using two optimization methods:
(a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We
additionally extend the setup of Gao et al. (2023) to include 25% label noise
to better mirror real-world conditions. Both with and without label noise, we
find that conservative optimization practically eliminates overoptimization and
improves performance by up to 70% for BoN sampling. For PPO, ensemble-based
conservative optimization always reduces overoptimization and outperforms
single reward model optimization. Moreover, combining it with a small KL
penalty successfully prevents overoptimization at no performance cost. Overall,
our results demonstrate that ensemble-based conservative optimization can
effectively counter overoptimization.
Related papers
- Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss.
OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - AIPO: Improving Training Objective for Iterative Preference Optimization [34.24211649396053]
We study iterative preference optimization with synthetic data.
We propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO)
arXiv Detail & Related papers (2024-09-13T14:03:49Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Optimizer's Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization [16.57676001669012]
In data-driven optimization, the sample performance of the obtained decision typically incurs an optimistic bias against the true performance.
Common techniques to correct this bias, such as cross-validation, require repeatedly solving additional optimization problems and are therefore expensive.
We develop a general bias correction approach that directly approximates the first-order bias and does not require solving any additional optimization problems.
arXiv Detail & Related papers (2023-06-16T07:07:58Z) - Optimizer Amalgamation [124.33523126363728]
We are motivated to study a new problem named Amalgamation: how can we best combine a pool of "teacher" amalgamations into a single "student" that can have stronger problem-specific performance?
First, we define three differentiable mechanisms to amalgamate a pool of analyticals by gradient descent.
In order to reduce variance of the process, we also explore methods to stabilize the process by perturbing the target.
arXiv Detail & Related papers (2022-03-12T16:07:57Z) - Bayesian Optimization for Selecting Efficient Machine Learning Models [53.202224677485525]
We present a unified Bayesian Optimization framework for jointly optimizing models for both prediction effectiveness and training efficiency.
Experiments on model selection for recommendation tasks indicate models selected this way significantly improves model training efficiency.
arXiv Detail & Related papers (2020-08-02T02:56:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.