Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
- URL: http://arxiv.org/abs/2412.09544v1
- Date: Thu, 12 Dec 2024 18:34:47 GMT
- Title: Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
- Authors: Paria Rashidinejad, Yuandong Tian,
- Abstract summary: We investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset.<n>We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable.<n>We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking.
- Score: 36.69993567249251
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain "stationary labels", resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking.
Related papers
- Inference-Time Reward Hacking in Large Language Models [18.461698175682987]
Reward models function as proxies for complex desiderata such as correctness, helpfulness, and safety.<n>By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance.<n>We introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter and avoid reward hacking.
arXiv Detail & Related papers (2025-06-24T02:05:25Z) - On Symmetric Losses for Robust Policy Optimization with Noisy Preferences [55.8615920580824]
This work focuses on reward modeling, a core component in reinforcement learning from human feedback.<n>We propose a principled framework for robust policy optimization under noisy preferences.<n>We prove that symmetric losses enable successful policy optimization even under noisy labels.
arXiv Detail & Related papers (2025-05-30T15:30:43Z) - Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems [6.873119751136341]
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently.<n>This prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples.<n>We propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance.
arXiv Detail & Related papers (2025-05-21T07:26:36Z) - Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.
We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.
We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z) - Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models with human values.
reward shaping helps stabilize RLHF and partially mitigate reward hacking.
We present a comprehensive study of the prevalent reward shaping methods.
We propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model itself as the signal for reinforcement learning.
arXiv Detail & Related papers (2025-02-26T02:57:59Z) - Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment [19.02679077706812]
We study the problem of aligning large language models with human preference data.
We propose direct preference optimization (Cal-DPO), a simple yet effective algorithm.
The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.
arXiv Detail & Related papers (2024-12-19T04:31:56Z) - Systematic Reward Gap Optimization for Mitigating VLM Hallucinations [34.71750379630014]
We introduce Topic-level Preference Rewriting (TPR), a novel framework designed for the systematic optimization of reward gap configuration.<n>TPR provides topic-level control over fine-grained semantic details, enabling advanced data curation strategies.<n>It significantly reduces hallucinations by up to 93% on ObjectHal-Bench, and also exhibits superior data efficiency towards robust and cost-effective VLM alignment.
arXiv Detail & Related papers (2024-11-26T09:42:07Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance.
This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF.
Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment [7.349727826230864]
We propose Regularized Best-of-N (RBoN) to mitigate reward hacking.
RBoN incorporates a proximity term in response selection, similar to preference learning techniques.
Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN.
arXiv Detail & Related papers (2024-04-01T11:26:50Z) - Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences.
Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources.
In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z) - Stabilizing RLHF through Advantage Model and Selective Rehearsal [57.504894664689]
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences remains a significant challenge.
This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting.
We propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score and regulates score distributions across tasks to prevent reward hacking; and 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing.
arXiv Detail & Related papers (2023-09-18T23:06:32Z) - Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping.
The proposed framework alternates between updating a policy and inferring a reward function.
Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z) - DORB: Dynamically Optimizing Multiple Rewards with Bandits [101.68525259222164]
Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks.
We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit)
We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
arXiv Detail & Related papers (2020-11-15T21:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.