Dealing with Sparse Rewards in Continuous Control Robotics via
Heavy-Tailed Policies
- URL: http://arxiv.org/abs/2206.05652v1
- Date: Sun, 12 Jun 2022 04:09:39 GMT
- Title: Dealing with Sparse Rewards in Continuous Control Robotics via
Heavy-Tailed Policies
- Authors: Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Pratap Tokekar,
and Dinesh Manocha
- Abstract summary: We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems.
We show consistent performance improvement across all tasks in terms of high average cumulative reward.
- Score: 64.2210390071609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient
(HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous
control problems. Sparse reward is common in continuous control robotics tasks
such as manipulation and navigation, and makes the learning problem hard due to
non-trivial estimation of value functions over the state space. This demands
either reward shaping or expert demonstrations for the sparse reward
environment. However, obtaining high-quality demonstrations is quite expensive
and sometimes even impossible. We propose a heavy-tailed policy parametrization
along with a modified momentum-based policy gradient tracking scheme (HT-SPG)
to induce a stable exploratory behavior to the algorithm. The proposed
algorithm does not require access to expert demonstrations. We test the
performance of HT-SPG on various benchmark tasks of continuous control with
sparse rewards such as 1D Mario, Pathological Mountain Car, Sparse Pendulum in
OpenAI Gym, and Sparse MuJoCo environments (Hopper-v2). We show consistent
performance improvement across all tasks in terms of high average cumulative
reward. HT-SPG also demonstrates improved convergence speed with minimum
samples, thereby emphasizing the sample efficiency of our proposed algorithm.
Related papers
- Trajectory-Oriented Policy Optimization with Sparse Rewards [2.9602904918952695]
We introduce an approach leveraging offline demonstration trajectories for swifter and more efficient online RL in environments with sparse rewards.
Our pivotal insight involves treating offline demonstration trajectories as guidance, rather than mere imitation.
We then illustrate that this optimization problem can be streamlined into a policy-gradient algorithm, integrating rewards shaped by insights from offline demonstrations.
arXiv Detail & Related papers (2024-01-04T12:21:01Z) - Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations [2.709826237514737]
The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning.
We propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG)
We show POSG's significant advantages in control performance and convergence speed in four sparse-reward environments.
arXiv Detail & Related papers (2023-12-30T07:41:45Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Continuous-Time Fitted Value Iteration for Robust Policies [93.25997466553929]
Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics.
We propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI)
These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems.
arXiv Detail & Related papers (2021-10-05T11:33:37Z) - Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward
Model [24.030426634281643]
In continuous control tasks, widely used policies with Gaussian distributions results in ineffective exploration of environments.
We propose a density-free off-policy algorithm, Generative Actor-Critic, using the push-forward model to increase the expressiveness of policies.
We show that push-forward policies possess desirable features, such as multi-modality, which can improve the efficiency of exploration and performance of algorithms obviously.
arXiv Detail & Related papers (2021-05-08T16:29:20Z) - Deep Reinforcement Learning for Haptic Shared Control in Unknown Tasks [1.0635248457021496]
Haptic shared control (HSC) is an alternative to direct teleoperation in teleoperated systems.
The application of virtual guiding forces decreases the user's control effort and improves execution time in various tasks.
The challenge lies in developing controllers to provide the optimal guiding forces for the tasks that are being performed.
This work addresses this challenge by designing a controller based on the deep deterministic policy gradient (DDPG) algorithm to provide the assistance, and a convolutional neural network (CNN) to perform the task detection.
arXiv Detail & Related papers (2021-01-15T17:27:38Z) - Demonstration-efficient Inverse Reinforcement Learning in Procedurally
Generated Environments [137.86426963572214]
Inverse Reinforcement Learning can extrapolate reward functions from expert demonstrations.
We show that our approach, DE-AIRL, is demonstration-efficient and still able to extrapolate reward functions which generalize to the fully procedural domain.
arXiv Detail & Related papers (2020-12-04T11:18:02Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.