Benchmarking Potential Based Rewards for Learning Humanoid Locomotion
- URL: http://arxiv.org/abs/2307.10142v1
- Date: Wed, 19 Jul 2023 17:12:28 GMT
- Title: Benchmarking Potential Based Rewards for Learning Humanoid Locomotion
- Authors: Se Hwan Jeon, Steve Heim, Charles Khazoom, Sangbae Kim
- Abstract summary: Well-designed shaping reward can lead to significantly faster learning.
In theory, the broad class of potential based reward shaping (PBRS) can help guide the learning process without affecting the optimal policy.
In this paper, we benchmark standard forms of shaping with PBRS for a humanoid robot.
- Score: 10.406358397515838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The main challenge in developing effective reinforcement learning (RL)
pipelines is often the design and tuning the reward functions. Well-designed
shaping reward can lead to significantly faster learning. Naively formulated
rewards, however, can conflict with the desired behavior and result in
overfitting or even erratic performance if not properly tuned. In theory, the
broad class of potential based reward shaping (PBRS) can help guide the
learning process without affecting the optimal policy. Although several studies
have explored the use of potential based reward shaping to accelerate learning
convergence, most have been limited to grid-worlds and low-dimensional systems,
and RL in robotics has predominantly relied on standard forms of reward
shaping. In this paper, we benchmark standard forms of shaping with PBRS for a
humanoid robot. We find that in this high-dimensional system, PBRS has only
marginal benefits in convergence speed. However, the PBRS reward terms are
significantly more robust to scaling than typical reward shaping approaches,
and thus easier to tune.
Related papers
- ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization [41.074747242532695]
Online Reward Selection and Policy Optimization (ORSO) is a novel approach that frames shaping reward selection as an online model selection problem.
ORSO employs principled exploration strategies to automatically identify promising shaping reward functions without human intervention.
We demonstrate ORSO's effectiveness across various continuous control tasks using the Isaac Gym simulator.
arXiv Detail & Related papers (2024-10-17T17:55:05Z) - Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning [7.07264650720021]
Sub-optimal Data Pre-training, SDP, is an approach that leverages reward-free, sub-optimal data to improve HitL RL algorithms.
We show SDP can significantly improve or achieve competitive performance with state-of-the-art HitL RL algorithms.
arXiv Detail & Related papers (2024-04-30T18:58:33Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - Handling Sparse Rewards in Reinforcement Learning Using Model Predictive
Control [9.118706387430883]
Reinforcement learning (RL) has recently proven great success in various domains.
Yet, the design of the reward function requires detailed domain expertise and tedious fine-tuning to ensure that agents are able to learn the desired behaviour.
We propose to use model predictive control(MPC) as an experience source for training RL agents in sparse reward environments.
arXiv Detail & Related papers (2022-10-04T11:06:38Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.