Rewards Encoding Environment Dynamics Improves Preference-based
Reinforcement Learning
- URL: http://arxiv.org/abs/2211.06527v1
- Date: Sat, 12 Nov 2022 00:34:41 GMT
- Title: Rewards Encoding Environment Dynamics Improves Preference-based
Reinforcement Learning
- Authors: Katherine Metcalf and Miguel Sarabia and Barry-John Theobald
- Abstract summary: We show that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks.
For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward.
- Score: 4.969254618158096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preference-based reinforcement learning (RL) algorithms help avoid the
pitfalls of hand-crafted reward functions by distilling them from human
preference feedback, but they remain impractical due to the burdensome number
of labels required from the human, even for relatively simple tasks. In this
work, we demonstrate that encoding environment dynamics in the reward function
(REED) dramatically reduces the number of preference labels required in
state-of-the-art preference-based RL frameworks. We hypothesize that REED-based
methods better partition the state-action space and facilitate generalization
to state-action pairs not included in the preference dataset. REED iterates
between encoding environment dynamics in a state-action representation via a
self-supervised temporal consistency task, and bootstrapping the
preference-based reward function from the state-action representation. Whereas
prior approaches train only on the preference-labelled trajectory pairs, REED
exposes the state-action representation to all transitions experienced during
policy training. We explore the benefits of REED within the PrefPPO [1] and
PEBBLE [2] preference learning frameworks and demonstrate improvements across
experimental conditions to both the speed of policy learning and the final
policy performance. For example, on quadruped-walk and walker-walk with 50
preference labels, REED-based reward functions recover 83% and 66% of ground
truth reward policy performance and without REED only 38\% and 21\% are
recovered. For some domains, REED-based reward functions result in policies
that outperform policies trained on the ground truth reward.
Related papers
- PRACT: Optimizing Principled Reasoning and Acting of LLM Agent [96.10771520261596]
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data.
We propose a new optimization framework, Reflective Principle Optimization (RPO), to adapt action principles to specific task requirements.
Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.
arXiv Detail & Related papers (2024-10-24T08:21:51Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery [1.1394969272703013]
adversarial inverse reinforcement learning (AIRL) serves as a foundational approach to providing comprehensive and transferable task descriptions.
This paper reexamines AIRL in light of the unobservable transition matrix or limited informative priors.
We show that AIRL can disentangle rewards for effective transfer with high probability, irrespective of specific conditions.
arXiv Detail & Related papers (2024-10-10T06:21:32Z) - Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions [8.90692770076582]
Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning.
We show that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods.
We refer to this technique as generalized marginalization, whose advantage is that negative weights for policies conditioned on low rewards can make the resulting policies more distinct from them.
arXiv Detail & Related papers (2024-06-16T03:43:55Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Hindsight PRIORs for Reward Learning from Human Preferences [3.4990427823966828]
Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors.
Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference.
We introduce a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance.
arXiv Detail & Related papers (2024-04-12T21:59:42Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - DIRECT: Learning from Sparse and Shifting Rewards using Discriminative
Reward Co-Training [13.866486498822228]
We propose discriminative reward co-training as an extension to deep reinforcement learning algorithms.
A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies.
Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments.
arXiv Detail & Related papers (2023-01-18T10:42:00Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping.
The proposed framework alternates between updating a policy and inferring a reward function.
Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.