Related papers: Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

URL: http://arxiv.org/abs/2406.02900v2
Date: Tue, 05 Nov 2024 01:44:14 GMT
Title: Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
Authors: Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, Scott Niekum,
Abstract summary: Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline. This work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.
Score: 50.808123629394245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline by circumventing the reward modeling phase. However, although DAAs do not use a separate proxy reward model, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL budgets, DAA algorithms exhibit similar degradation patterns to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL budgets but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.

Related papers

Reward Model Overoptimisation in Iterated RLHF [3.6701456157280052]
Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences.<n> RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function.<n>We present the first comprehensive study of overoptimisation in iterated RLHF.
arXiv Detail & Related papers (2025-05-23T17:36:13Z)
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning [3.30671592417223]
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models with human preferences. Most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. We propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications.
arXiv Detail & Related papers (2025-04-03T16:16:35Z)
ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models. We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z)
Provably Efficient Online RLHF with One-Pass Reward Modeling [59.30310692855397]
We propose a one-pass reward modeling method that does not require storing the historical data and can be computed in constant time.<n>We provide theoretical guarantees showing that our method improves both statistical and computational efficiency.<n>We conduct experiments using Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on the Ultrafeedback-binarized and Mixture2 datasets.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization [22.67700436936984]
We introduce Direct Advantage Policy Optimization (DAPO), a novel step-level offline reinforcement learning algorithm. DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy. Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.
arXiv Detail & Related papers (2024-12-24T08:39:35Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Averaging log-likelihoods in direct alignment [43.77763433288893]
We introduce a new averaging operator to be composed with the optimality operator giving the best policy for the underlying RL problem. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.
arXiv Detail & Related papers (2024-06-27T14:07:38Z)
SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences. Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation. Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z)
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference. The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z)
Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z)
A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories. We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem. The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.