Related papers: Dense Reward for Free in Reinforcement Learning from Human Feedback

Dense Reward for Free in Reinforcement Learning from Human Feedback

URL: http://arxiv.org/abs/2402.00782v1
Date: Thu, 1 Feb 2024 17:10:35 GMT
Title: Dense Reward for Free in Reinforcement Learning from Human Feedback
Authors: Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar
Abstract summary: We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
Score: 64.92448888346125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) has been credited as the key advance that has allowed Large Language Models (LLMs) to effectively follow instructions and produce useful assistance. Classically, this involves generating completions from the LLM in response to a query before using a separate reward model to assign a score to the full completion. As an auto-regressive process, the LLM has to take many "actions" (selecting individual tokens) and only receives a single, sparse reward at the end of an episode, a setup that is known to be difficult to optimise in traditional reinforcement learning. In this work we leverage the fact that the reward model contains more information than just its scalar output, in particular, it calculates an attention map over tokens as part of the transformer architecture. We use these attention weights to redistribute the reward along the whole completion, effectively densifying the signal and highlighting the most important tokens, all without incurring extra computational cost or requiring any additional modelling. We demonstrate that, theoretically, this approach is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.

Related papers

Generalist Reward Models: Found Inside Large Language Models [50.7432354447554]
We show that a powerful reward model is already latently present within any Large Language Models (LLMs) trained via standard next-token prediction.<n>We prove that this endogenous reward is not a reward function learned through offline inverse reinforcement learning.<n>We also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model.
arXiv Detail & Related papers (2025-06-29T13:45:54Z)
Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z)
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z)
SCAR: Shapley Credit Assignment for More Efficient RLHF [57.09358918390916]
Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models with human preferences.<n>It often suffers from sparse reward signals, making effective credit assignment challenging.<n>We propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory.
arXiv Detail & Related papers (2025-05-26T18:06:52Z)
Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment. We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z)
Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next token prediction is the fundamental principle for training large language models (LLMs) We introduce R1-SGG, a multimodal LLM (M-LLM) trained via supervised fine-tuning (SFT) on the scene graph dataset. We design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference [27.205035058481553]
We propose assigning scores to every sentence, introducing an intermediate-grained reward model. A novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score. Our method outperforms the response-level reward model by 2.7% on RewardBench.
arXiv Detail & Related papers (2025-03-01T14:11:04Z)
T-REG: Preference Optimization with Token-Level Reward Regularization [35.07328450591201]
Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models with human values. Recent methods have attempted to address this limitation by introducing token-level rewards. We propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization.
arXiv Detail & Related papers (2024-12-03T18:56:07Z)
Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment [0.618727087412292]
The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function. We propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Learning (AVRIL)
arXiv Detail & Related papers (2024-11-14T10:37:34Z)
R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z)
Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL [14.091146805312636]
Credit assignment problem is a central challenge in Reinforcement Learning (RL) Credit Assignment with Language Models (CALM) is a novel approach to automate credit assignment via reward shaping and options discovery. Preliminary results indicate that the knowledge of Large Language Models is a promising prior for credit assignment in RL.
arXiv Detail & Related papers (2024-09-19T14:08:09Z)
Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z)
Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning [49.87923965553233]
Reinforcement Learning can lead to reward over-optimization in large language models. We introduce the Reward from Demonstration (RCfD) to recalibrate the reward objective. We show that RCfD achieves comparable performance to carefully tuned baselines while mitigating ROO.
arXiv Detail & Related papers (2024-04-30T09:57:21Z)
Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process. We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
On the Theory of Reinforcement Learning with Once-per-Episode Feedback [120.5537226120512]
We introduce a theory of reinforcement learning in which the learner receives feedback only once at the end of an episode. This is arguably more representative of real-world applications than the traditional requirement that the learner receive feedback at every time step.
arXiv Detail & Related papers (2021-05-29T19:48:51Z)
Reward prediction for representation learning and reward shaping [0.8883733362171032]
We propose learning a state representation in a self-supervised manner for reward prediction. We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning.
arXiv Detail & Related papers (2021-05-07T11:29:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.