Related papers: Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

URL: http://arxiv.org/abs/2403.05171v2
Date: Tue, 9 Jul 2024 13:17:36 GMT
Title: Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation
Authors: Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu,
Abstract summary: Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback. In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
Score: 46.61909578101735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

Related papers

Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization [23.817251267022847]
We propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. BSPO reduces the generation of OOD responses during the reinforcement learning process. Empirical results show that BSPO outperforms baselines in preventing reward over-optimization.
arXiv Detail & Related papers (2025-03-23T16:20:59Z)
Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z)
Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function. We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z)
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs) We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback. Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Reward Model Ensembles Help Mitigate Overoptimization [7.715463015544845]
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As imperfect representations of the "true" reward, learned reward models are susceptible to overoptimization.
arXiv Detail & Related papers (2023-10-04T11:34:22Z)
Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling. We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.