Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping
- URL: http://arxiv.org/abs/2011.02669v1
- Date: Thu, 5 Nov 2020 05:34:14 GMT
- Title: Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping
- Authors: Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen,
Jianye Hao, Feng Wu, Changjie Fan
- Abstract summary: Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
- Score: 71.214923471669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward shaping is an effective technique for incorporating domain knowledge
into reinforcement learning (RL). Existing approaches such as potential-based
reward shaping normally make full use of a given shaping reward function.
However, since the transformation of human knowledge into numeric reward values
is often imperfect due to reasons such as human cognitive bias, completely
utilizing the shaping reward function may fail to improve the performance of RL
algorithms. In this paper, we consider the problem of adaptively utilizing a
given shaping reward function. We formulate the utilization of shaping rewards
as a bi-level optimization problem, where the lower level is to optimize policy
using the shaping rewards and the upper level is to optimize a parameterized
shaping weight function for true reward maximization. We formally derive the
gradient of the expected true reward with respect to the shaping weight
function parameters and accordingly propose three learning algorithms based on
different assumptions. Experiments in sparse-reward cartpole and MuJoCo
environments show that our algorithms can fully exploit beneficial shaping
rewards, and meanwhile ignore unbeneficial shaping rewards or even transform
them into beneficial ones.
Related papers
- ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization [41.074747242532695]
Online Reward Selection and Policy Optimization (ORSO) is a novel approach that frames shaping reward selection as an online model selection problem.
ORSO employs principled exploration strategies to automatically identify promising shaping reward functions without human intervention.
We demonstrate ORSO's effectiveness across various continuous control tasks using the Isaac Gym simulator.
arXiv Detail & Related papers (2024-10-17T17:55:05Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - A Novel Variational Lower Bound for Inverse Reinforcement Learning [5.370126167091961]
Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories.
We present a new Variational Lower Bound for IRL (VLB-IRL)
Our method simultaneously learns the reward function and policy under the learned reward function.
arXiv Detail & Related papers (2023-11-07T03:50:43Z) - Benchmarking Potential Based Rewards for Learning Humanoid Locomotion [10.406358397515838]
Well-designed shaping reward can lead to significantly faster learning.
In theory, the broad class of potential based reward shaping (PBRS) can help guide the learning process without affecting the optimal policy.
In this paper, we benchmark standard forms of shaping with PBRS for a humanoid robot.
arXiv Detail & Related papers (2023-07-19T17:12:28Z) - Unpacking Reward Shaping: Understanding the Benefits of Reward
Engineering on Sample Complexity [114.88145406445483]
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications.
In practice the choice of reward function can be crucial for good results.
arXiv Detail & Related papers (2022-10-18T04:21:25Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world.
Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it.
We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z) - Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies.
First, we propose reward machines, a type of finite state machine that supports the specification of reward functions.
We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z) - Temporal-Logic-Based Reward Shaping for Continuing Learning Tasks [57.17673320237597]
In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation.
This paper presents the first reward shaping framework for average-reward learning.
It proves that, under standard assumptions, the optimal policy under the original reward function can be recovered.
arXiv Detail & Related papers (2020-07-03T05:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.