D-Shape: Demonstration-Shaped Reinforcement Learning via Goal
Conditioning
- URL: http://arxiv.org/abs/2210.14428v1
- Date: Wed, 26 Oct 2022 02:28:32 GMT
- Title: D-Shape: Demonstration-Shaped Reinforcement Learning via Goal
Conditioning
- Authors: Caroline Wang, Garrett Warnell, Peter Stone
- Abstract summary: D-Shape is a new method for combining imitation learning (IL) and reinforcement learning (RL)
This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict.
We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy.
- Score: 48.57484755946714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While combining imitation learning (IL) and reinforcement learning (RL) is a
promising way to address poor sample efficiency in autonomous behavior
acquisition, methods that do so typically assume that the requisite behavior
demonstrations are provided by an expert that behaves optimally with respect to
a task reward. If, however, suboptimal demonstrations are provided, a
fundamental challenge appears in that the demonstration-matching objective of
IL conflicts with the return-maximization objective of RL. This paper
introduces D-Shape, a new method for combining IL and RL that uses ideas from
reward shaping and goal-conditioned RL to resolve the above conflict. D-Shape
allows learning from suboptimal demonstrations while retaining the ability to
find the optimal policy with respect to the task reward. We experimentally
validate D-Shape in sparse-reward gridworld domains, showing that it both
improves over RL in terms of sample efficiency and converges consistently to
the optimal policy in the presence of suboptimal demonstrations.
Related papers
- Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment [0.618727087412292]
The alignment of large language models (LLMs) is crucial for generating helpful and harmless content.
Existing approaches leverage preference-based human feedback data to learn the reward function.
We propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Learning (AVRIL)
arXiv Detail & Related papers (2024-11-14T10:37:34Z) - Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment [62.05713042908654]
We introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges.
We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals.
Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD.
arXiv Detail & Related papers (2024-05-24T15:13:53Z) - REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning [7.889696505137217]
We propose Revealing Evolutionary Action Consequence Trajectories (REACT) to enhance the interpretability of Reinforcement Learning (RL)
In contrast to the prevalent practice of RL models based on their optimal behavior learned during training, we posit that considering a range of edge-case trajectories provides a more comprehensive understanding of their inherent behavior.
Our results highlight its effectiveness in revealing nuanced aspects of RL models' behavior beyond optimal performance, thereby contributing to improved interpretability.
arXiv Detail & Related papers (2024-04-04T10:56:30Z) - Inverse Reinforcement Learning by Estimating Expertise of Demonstrators [18.50354748863624]
IRLEED, Inverse Reinforcement Learning by Estimating Expertise of Demonstrators, is a novel framework that overcomes hurdles without prior knowledge of demonstrator expertise.
IRLEED enhances existing Inverse Reinforcement Learning (IRL) algorithms by combining a general model for demonstrator suboptimality to address reward bias and action variance.
Experiments in both online and offline IL settings, with simulated and human-generated data, demonstrate IRLEED's adaptability and effectiveness.
arXiv Detail & Related papers (2024-02-02T20:21:09Z) - Trajectory-Oriented Policy Optimization with Sparse Rewards [2.9602904918952695]
We introduce an approach leveraging offline demonstration trajectories for swifter and more efficient online RL in environments with sparse rewards.
Our pivotal insight involves treating offline demonstration trajectories as guidance, rather than mere imitation.
We then illustrate that this optimization problem can be streamlined into a policy-gradient algorithm, integrating rewards shaped by insights from offline demonstrations.
arXiv Detail & Related papers (2024-01-04T12:21:01Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Residual Reinforcement Learning from Demonstrations [51.56457466788513]
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.
We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations.
Our experimental evaluation on simulated manipulation tasks on a 6-DoF UR5 arm and a 28-DoF dexterous hand demonstrates that residual RL from demonstrations is able to generalize to unseen environment conditions more flexibly than either behavioral cloning or RL fine-tuning.
arXiv Detail & Related papers (2021-06-15T11:16:49Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations [78.94386823185724]
Imitation learning learns effectively in sparse-rewarded tasks by leveraging the existing expert demonstrations.
In practice, collecting a sufficient amount of expert demonstrations can be prohibitively expensive.
We propose Self-Adaptive Learning (SAIL) that can achieve (near) optimal performance given only a limited number of sub-optimal demonstrations.
arXiv Detail & Related papers (2020-04-01T15:57:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.