Dense Reward for Free in Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2402.00782v1
- Date: Thu, 1 Feb 2024 17:10:35 GMT
- Title: Dense Reward for Free in Reinforcement Learning from Human Feedback
- Authors: Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar
- Abstract summary: We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
- Score: 64.92448888346125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has been credited as the
key advance that has allowed Large Language Models (LLMs) to effectively follow
instructions and produce useful assistance. Classically, this involves
generating completions from the LLM in response to a query before using a
separate reward model to assign a score to the full completion. As an
auto-regressive process, the LLM has to take many "actions" (selecting
individual tokens) and only receives a single, sparse reward at the end of an
episode, a setup that is known to be difficult to optimise in traditional
reinforcement learning. In this work we leverage the fact that the reward model
contains more information than just its scalar output, in particular, it
calculates an attention map over tokens as part of the transformer
architecture. We use these attention weights to redistribute the reward along
the whole completion, effectively densifying the signal and highlighting the
most important tokens, all without incurring extra computational cost or
requiring any additional modelling. We demonstrate that, theoretically, this
approach is equivalent to potential-based reward shaping, ensuring that the
optimal policy remains unchanged. Empirically, we show that it stabilises
training, accelerates the rate of learning, and, in practical cases, may lead
to better local optima.
Related papers
- Offline Reinforcement Learning with Imputed Rewards [8.856568375969848]
We propose a Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards.
Our results show that, using only 1% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99% of the transitions.
arXiv Detail & Related papers (2024-07-15T15:53:13Z) - Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering.
We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages.
1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data.
2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning step to fine-tune the model.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning [49.87923965553233]
Reinforcement Learning can lead to reward over-optimization in large language models.
We introduce the Reward from Demonstration (RCfD) to recalibrate the reward objective.
We show that RCfD achieves comparable performance to carefully tuned baselines while mitigating ROO.
arXiv Detail & Related papers (2024-04-30T09:57:21Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - On the Theory of Reinforcement Learning with Once-per-Episode Feedback [120.5537226120512]
We introduce a theory of reinforcement learning in which the learner receives feedback only once at the end of an episode.
This is arguably more representative of real-world applications than the traditional requirement that the learner receive feedback at every time step.
arXiv Detail & Related papers (2021-05-29T19:48:51Z) - Reward prediction for representation learning and reward shaping [0.8883733362171032]
We propose learning a state representation in a self-supervised manner for reward prediction.
We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning.
arXiv Detail & Related papers (2021-05-07T11:29:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.