Learning from demonstrations with SACR2: Soft Actor-Critic with Reward
Relabeling
- URL: http://arxiv.org/abs/2110.14464v1
- Date: Wed, 27 Oct 2021 14:30:29 GMT
- Title: Learning from demonstrations with SACR2: Soft Actor-Critic with Reward
Relabeling
- Authors: Jesus Bujalance Martin, Rapha\"el Chekroun and Fabien Moutarde
- Abstract summary: Off-policy algorithms tend to be more sample-efficient, and can additionally benefit from any off-policy data stored in the replay buffer.
Expert demonstrations are a popular source for such data.
We present a new method, based on a reward bonus given to demonstrations and successful episodes.
- Score: 2.1485350418225244
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: During recent years, deep reinforcement learning (DRL) has made successful
incursions into complex decision-making applications such as robotics,
autonomous driving or video games. However, a well-known caveat of DRL
algorithms is their inefficiency, requiring huge amounts of data to converge.
Off-policy algorithms tend to be more sample-efficient, and can additionally
benefit from any off-policy data stored in the replay buffer. Expert
demonstrations are a popular source for such data: the agent is exposed to
successful states and actions early on, which can accelerate the learning
process and improve performance. In the past, multiple ideas have been proposed
to make good use of the demonstrations in the buffer, such as pretraining on
demonstrations only or minimizing additional cost functions. We carry on a
study to evaluate several of these ideas in isolation, to see which of them
have the most significant impact. We also present a new method, based on a
reward bonus given to demonstrations and successful episodes. First, we give a
reward bonus to the transitions coming from demonstrations to encourage the
agent to match the demonstrated behaviour. Then, upon collecting a successful
episode, we relabel its transitions with the same bonus before adding them to
the replay buffer, encouraging the agent to also match its previous successes.
The base algorithm for our experiments is the popular Soft Actor-Critic (SAC),
a state-of-the-art off-policy algorithm for continuous action spaces. Our
experiments focus on robotics, specifically on a reaching task for a robotic
arm in simulation. We show that our method SACR2 based on reward relabeling
improves the performance on this task, even in the absence of demonstrations.
Related papers
- Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions.
By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories.
arXiv Detail & Related papers (2024-11-19T01:23:52Z) - Latent Action Priors From a Single Gait Cycle Demonstration for Online Imitation Learning [42.642008092347986]
We propose an additional inductive bias for robot learning: latent actions learned from expert demonstration as priors in the action space.
We show that these action priors can be learned from only a single open-loop gait cycle using a simple autoencoder.
arXiv Detail & Related papers (2024-10-04T09:10:56Z) - Handling Sparse Rewards in Reinforcement Learning Using Model Predictive
Control [9.118706387430883]
Reinforcement learning (RL) has recently proven great success in various domains.
Yet, the design of the reward function requires detailed domain expertise and tedious fine-tuning to ensure that agents are able to learn the desired behaviour.
We propose to use model predictive control(MPC) as an experience source for training RL agents in sparse reward environments.
arXiv Detail & Related papers (2022-10-04T11:06:38Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Self-Imitation Learning from Demonstrations [4.907551775445731]
Self-Imitation Learning exploits agent's past good experience to learn from suboptimal demonstrations.
We show that SILfD can learn from demonstrations that are noisy or far from optimal.
We also find SILfD superior to the existing state-of-the-art LfD algorithms in sparse environments.
arXiv Detail & Related papers (2022-03-21T11:56:56Z) - Reward Relabelling for combined Reinforcement and Imitation Learning on
sparse-reward tasks [2.0305676256390934]
We present a new method to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm.
Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation.
Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation.
arXiv Detail & Related papers (2022-01-11T08:35:18Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z) - A Framework for Efficient Robotic Manipulation [79.10407063260473]
We show that a single robotic arm can learn sparse-reward manipulation policies from pixels.
We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels.
arXiv Detail & Related papers (2020-12-14T22:18:39Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Forgetful Experience Replay in Hierarchical Reinforcement Learning from
Demonstrations [55.41644538483948]
In this paper, we propose a combination of approaches that allow the agent to use low-quality demonstrations in complex vision-based environments.
Our proposed goal-oriented structuring of replay buffer allows the agent to automatically highlight sub-goals for solving complex hierarchical tasks in demonstrations.
The solution based on our algorithm beats all the solutions for the famous MineRL competition and allows the agent to mine a diamond in the Minecraft environment.
arXiv Detail & Related papers (2020-06-17T15:38:40Z) - Dynamic Experience Replay [6.062589413216726]
We build upon Ape-X DDPG and demonstrate our approach on robotic tight-fitting joint assembly tasks.
In particular, we run experiments on two different tasks: peg-in-hole and lap-joint.
Our ablation studies show that Dynamic Experience Replay is a crucial ingredient that either largely shortens the training time in these challenging environments.
arXiv Detail & Related papers (2020-03-04T23:46:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.