Related papers: Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

URL: http://arxiv.org/abs/2211.06527v1
Date: Sat, 12 Nov 2022 00:34:41 GMT
Title: Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning
Authors: Katherine Metcalf and Miguel Sarabia and Barry-John Theobald
Abstract summary: We show that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks. For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward.
Score: 4.969254618158096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of hand-crafted reward functions by distilling them from human preference feedback, but they remain impractical due to the burdensome number of labels required from the human, even for relatively simple tasks. In this work, we demonstrate that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks. We hypothesize that REED-based methods better partition the state-action space and facilitate generalization to state-action pairs not included in the preference dataset. REED iterates between encoding environment dynamics in a state-action representation via a self-supervised temporal consistency task, and bootstrapping the preference-based reward function from the state-action representation. Whereas prior approaches train only on the preference-labelled trajectory pairs, REED exposes the state-action representation to all transitions experienced during policy training. We explore the benefits of REED within the PrefPPO [1] and PEBBLE [2] preference learning frameworks and demonstrate improvements across experimental conditions to both the speed of policy learning and the final policy performance. For example, on quadruped-walk and walker-walk with 50 preference labels, REED-based reward functions recover 83% and 66% of ground truth reward policy performance and without REED only 38\% and 21\% are recovered. For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward.

Related papers

FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning. Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z)
PRACT: Optimizing Principled Reasoning and Acting of LLM Agent [96.10771520261596]
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. We propose a new optimization framework, Reflective Principle Optimization (RPO), to adapt action principles to specific task requirements. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.
arXiv Detail & Related papers (2024-10-24T08:21:51Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery [1.1394969272703013]
adversarial inverse reinforcement learning (AIRL) serves as a foundational approach to providing comprehensive and transferable task descriptions. This paper reexamines AIRL in light of the unobservable transition matrix or limited informative priors. We show that AIRL can disentangle rewards for effective transfer with high probability, irrespective of specific conditions.
arXiv Detail & Related papers (2024-10-10T06:21:32Z)
WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP) WARP merges policies in the weight space at three distinct stages. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z)
Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions [8.90692770076582]
Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning. We show that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods. We refer to this technique as generalized marginalization, whose advantage is that negative weights for policies conditioned on low rewards can make the resulting policies more distinct from them.
arXiv Detail & Related papers (2024-06-16T03:43:55Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL. We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
Hindsight PRIORs for Reward Learning from Human Preferences [3.4990427823966828]
Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference. We introduce a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance.
arXiv Detail & Related papers (2024-04-12T21:59:42Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training [13.866486498822228]
We propose discriminative reward co-training as an extension to deep reinforcement learning algorithms. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments.
arXiv Detail & Related papers (2023-01-18T10:42:00Z)
A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z)
Variance Reduction based Experience Replay for Policy Optimization [3.0790370651488983]
Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation. VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
arXiv Detail & Related papers (2021-10-17T19:28:45Z)
Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping. The proposed framework alternates between updating a policy and inferring a reward function. Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.