Potential-Based Reward Shaping For Intrinsic Motivation
- URL: http://arxiv.org/abs/2402.07411v1
- Date: Mon, 12 Feb 2024 05:12:09 GMT
- Title: Potential-Based Reward Shaping For Intrinsic Motivation
- Authors: Grant C. Forbes, Nitish Gupta, Leonardo Villalobos-Arias, Colin M.
Potts, Arnav Jhala, David L. Roberts
- Abstract summary: Intrinsic motivation (IM) reward-shaping methods can inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior.
We present an extension to PBRS that we prove preserves the set of optimal policies under a more general set of functions.
We also present em Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that is useable without altering the set of optimal policies.
- Score: 4.798097103214276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently there has been a proliferation of intrinsic motivation (IM)
reward-shaping methods to learn in complex and sparse-reward environments.
These methods can often inadvertently change the set of optimal policies in an
environment, leading to suboptimal behavior. Previous work on mitigating the
risks of reward shaping, particularly through potential-based reward shaping
(PBRS), has not been applicable to many IM methods, as they are often complex,
trainable functions themselves, and therefore dependent on a wider set of
variables than the traditional reward functions that PBRS was developed for. We
present an extension to PBRS that we prove preserves the set of optimal
policies under a more general set of functions than has been previously proven.
We also present {\em Potential-Based Intrinsic Motivation} (PBIM), a method for
converting IM rewards into a potential-based form that is useable without
altering the set of optimal policies. Testing in the MiniGrid DoorKey and Cliff
Walking environments, we demonstrate that PBIM successfully prevents the agent
from converging to a suboptimal policy and can speed up training.
Related papers
- FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning.
This reward is then used to fine-tune the pre-trained policy with reinforcement learning.
Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z) - Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards [2.2169849640518153]
Intrinsic motivation (IM) reward-shaping methods can inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior.
We present an extension to PBRS that we prove preserves the set of optimal policies under a more general set of functions.
We also present em Potential-Based Intrinsic Motivation (PBIM) and em Generalized Reward Matching (GRM) methods for converting IM rewards into a potential-based form.
arXiv Detail & Related papers (2024-10-16T03:39:26Z) - Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions [8.90692770076582]
Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning.
We show that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods.
We refer to this technique as generalized marginalization, whose advantage is that negative weights for policies conditioned on low rewards can make the resulting policies more distinct from them.
arXiv Detail & Related papers (2024-06-16T03:43:55Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - Useful Policy Invariant Shaping from Arbitrary Advice [24.59807772487328]
A major challenge of RL research is to discover how to learn with less data.
Potential-based reward shaping (PBRS) holds promise, but it is limited by the need for a well-defined potential function.
The recently introduced dynamic potential based advice (DPBA) method tackles this challenge by admitting arbitrary advice from a human or other agent.
arXiv Detail & Related papers (2020-11-02T20:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.