Information-Theoretic Reward Decomposition for Generalizable RLHF
- URL: http://arxiv.org/abs/2504.06020v1
- Date: Tue, 08 Apr 2025 13:26:07 GMT
- Title: Information-Theoretic Reward Decomposition for Generalizable RLHF
- Authors: Liyuan Mao, Haoran Xu, Amy Zhang, Weinan Zhang, Chenjia Bai,
- Abstract summary: We decompose the reward value into two independent components: prompt-free reward and prompt-related reward.<n>We propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.
- Score: 38.6093614792004
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.
Related papers
- Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference [27.205035058481553]
We propose assigning scores to every sentence, introducing an intermediate-grained reward model.<n>A novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score.<n>Our method outperforms the response-level reward model by 2.7% on RewardBench.
arXiv Detail & Related papers (2025-03-01T14:11:04Z) - Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)
We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.
We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z) - R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals.
Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards.
We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z) - RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals [11.121749884408331]
We develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to attributes.<n>RATE uses LLMs to rewrite responses to produce imperfect counterfactual examples that can be used to measure causal effects.<n>We establish the validity of the RATE procedure and show empirically that it is an effective estimator.
arXiv Detail & Related papers (2024-10-15T07:22:16Z) - Evaluating Robustness of Reward Models for Mathematical Reasoning [14.97819343313859]
We introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH.
We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization.
arXiv Detail & Related papers (2024-10-02T16:39:58Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Bayesian Reward Models for LLM Alignment [26.612181012468167]
We train a Bayesian reward model, which signals higher uncertainty further from the training data distribution.
We find that the resulting uncertainty estimates can effectively mitigate reward overoptimization in BoN sampling.
arXiv Detail & Related papers (2024-02-20T18:20:59Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.