Related papers: Reward Collapse in Aligning Large Language Models

Reward Collapse in Aligning Large Language Models

URL: http://arxiv.org/abs/2305.17608v1
Date: Sun, 28 May 2023 02:12:00 GMT
Title: Reward Collapse in Aligning Large Language Models
Authors: Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su
Abstract summary: We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
Score: 64.98482888193267
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results in an \textit{identical} reward distribution \textit{regardless} of the prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.

Related papers

Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z)
Information-Theoretic Reward Decomposition for Generalizable RLHF [38.6093614792004]
We decompose the reward value into two independent components: prompt-free reward and prompt-related reward. We propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.
arXiv Detail & Related papers (2025-04-08T13:26:07Z)
R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z)
Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data.<n>We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs.<n>Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z)
Transductive Reward Inference on Graph [53.003245457089406]
We develop a reward inference method based on the contextual properties of information propagation on graphs. We leverage both the available data and limited reward annotations to construct a reward propagation graph. We employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data.
arXiv Detail & Related papers (2024-02-06T03:31:28Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z)
Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach [45.83200636718999]
A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. We propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution. Experimental results show that our method outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-28T21:51:38Z)
Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity [114.88145406445483]
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications. In practice the choice of reward function can be crucial for good results.
arXiv Detail & Related papers (2022-10-18T04:21:25Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping. The proposed framework alternates between updating a policy and inferring a reward function. Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z)
Inverse Reinforcement Learning via Matching of Optimality Profiles [2.561053769852449]
We propose an algorithm that learns a reward function from demonstrations of suboptimal or heterogeneous performance. We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.
arXiv Detail & Related papers (2020-11-18T13:23:43Z)
Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.