Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
- URL: http://arxiv.org/abs/2009.14108v2
- Date: Tue, 28 Jun 2022 18:39:18 GMT
- Title: Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
- Authors: Vihang P. Patil, Markus Hofmarcher, Marius-Constantin Dinu, Matthias
Dorfer, Patrick M. Blies, Johannes Brandstetter, Jose A. Arjona-Medina, Sepp
Hochreiter
- Abstract summary: We introduce Align-RUDDER, which employs reward redistribution effectively and drastically improves learning on few demonstrations.
On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently.
- Score: 6.396567712417841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning algorithms require many samples when solving complex
hierarchical tasks with sparse and delayed rewards. For such complex tasks, the
recently proposed RUDDER uses reward redistribution to leverage steps in the
Q-function that are associated with accomplishing sub-tasks. However, often
only few episodes with high rewards are available as demonstrations since
current exploration strategies cannot discover them in reasonable time. In this
work, we introduce Align-RUDDER, which utilizes a profile model for reward
redistribution that is obtained from multiple sequence alignment of
demonstrations. Consequently, Align-RUDDER employs reward redistribution
effectively and, thereby, drastically improves learning on few demonstrations.
Align-RUDDER outperforms competitors on complex artificial tasks with delayed
rewards and few demonstrations. On the Minecraft ObtainDiamond task,
Align-RUDDER is able to mine a diamond, though not frequently. Code is
available at https://github.com/ml-jku/align-rudder. YouTube:
https://youtu.be/HO-_8ZUl-UY
Related papers
- Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks [26.730889757506915]
We propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks.
By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given.
Experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks.
arXiv Detail & Related papers (2024-04-25T17:28:33Z) - DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - Video Prediction Models as Rewards for Reinforcement Learning [127.53893027811027]
VIPER is an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning.
We see our work as starting point for scalable reward specification from unlabeled videos.
arXiv Detail & Related papers (2023-05-23T17:59:33Z) - Unified Demonstration Retriever for In-Context Learning [56.06473069923567]
Unified Demonstration Retriever (textbfUDR) is a single model to retrieve demonstrations for a wide range of tasks.
We propose a multi-task list-wise ranking training framework, with an iterative mining strategy to find high-quality candidates.
Experiments on 30+ tasks across 13 task families and multiple data domains show that UDR significantly outperforms baselines.
arXiv Detail & Related papers (2023-05-07T16:07:11Z) - Reinforcement learning with Demonstrations from Mismatched Task under
Sparse Reward [7.51772160511614]
Reinforcement learning often suffer from the sparse reward issue in real-world robotics problems.
Prior works often assume that the learning agent and the expert aim to accomplish the same task, which requires collecting new data for every new task.
In this paper, we consider the case where the target task is mismatched from but similar with that of the expert.
Existing LfD methods can not effectively guide learning in mismatched new tasks with sparse rewards.
arXiv Detail & Related papers (2022-12-03T02:24:59Z) - Agent-Time Attention for Sparse Rewards Multi-Agent Reinforcement
Learning [36.93626032028901]
Sparse and delayed rewards pose a challenge to single agent reinforcement learning.
We propose Agent-Time Attention (ATA), a neural network model with auxiliary losses for redistributing sparse and delayed rewards.
arXiv Detail & Related papers (2022-10-31T17:54:51Z) - Context-Hierarchy Inverse Reinforcement Learning [30.71220625227959]
inverse reinforcement learning (IRL) agent learns to act intelligently by observing expert demonstrations and learning the expert's underlying reward function.
We present Context Hierarchy IRL(CHIRL), a new IRL algorithm that exploits the context to scale up IRL and learn reward functions of complex behaviors.
Experiments on benchmark tasks, including a large scale autonomous driving task in the CARLA simulator, show promising results in scaling up IRL for tasks with complex reward functions.
arXiv Detail & Related papers (2022-02-25T10:29:05Z) - Reward Relabelling for combined Reinforcement and Imitation Learning on
sparse-reward tasks [2.0305676256390934]
We present a new method to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm.
Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation.
Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation.
arXiv Detail & Related papers (2022-01-11T08:35:18Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Forgetful Experience Replay in Hierarchical Reinforcement Learning from
Demonstrations [55.41644538483948]
In this paper, we propose a combination of approaches that allow the agent to use low-quality demonstrations in complex vision-based environments.
Our proposed goal-oriented structuring of replay buffer allows the agent to automatically highlight sub-goals for solving complex hierarchical tasks in demonstrations.
The solution based on our algorithm beats all the solutions for the famous MineRL competition and allows the agent to mine a diamond in the Minecraft environment.
arXiv Detail & Related papers (2020-06-17T15:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.