Robust Asymmetric Learning in POMDPs
- URL: http://arxiv.org/abs/2012.15566v2
- Date: Fri, 19 Mar 2021 10:57:52 GMT
- Title: Robust Asymmetric Learning in POMDPs
- Authors: Andrew Warrington and J. Wilder Lavington and Adam Scibior and Mark
Schmidt and Frank Wood
- Abstract summary: Existing approaches for imitation learning have a serious flaw: the expert does not know what the trainee cannot see.
We derive an objective to train the expert to maximize the expected reward of the imitating agent policy, and use it to construct an efficient algorithm, adaptive asymmetric DAgger (A2D)
We show that A2D produces an expert policy that the agent can safely imitate, in turn outperforming policies learned by imitating a fixed expert.
- Score: 24.45409442047289
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Policies for partially observed Markov decision processes can be efficiently
learned by imitating policies for the corresponding fully observed Markov
decision processes. Unfortunately, existing approaches for this kind of
imitation learning have a serious flaw: the expert does not know what the
trainee cannot see, and so may encourage actions that are sub-optimal, even
unsafe, under partial information. We derive an objective to instead train the
expert to maximize the expected reward of the imitating agent policy, and use
it to construct an efficient algorithm, adaptive asymmetric DAgger (A2D), that
jointly trains the expert and the agent. We show that A2D produces an expert
policy that the agent can safely imitate, in turn outperforming policies
learned by imitating a fixed expert.
Related papers
- From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.
We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning.
Our proposed method uses reinforcement learning with user intervention signals themselves as rewards.
This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z) - Causal Imitation Learning with Unobserved Confounders [82.22545916247269]
We study imitation learning when sensory inputs of the learner and the expert differ.
We show that imitation could still be feasible by exploiting quantitative knowledge of the expert trajectories.
arXiv Detail & Related papers (2022-08-12T13:29:53Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement
Learning [7.020079427649125]
We show that grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance.
We propose a probabilistic mixture-of-experts (PMOE) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem.
arXiv Detail & Related papers (2021-04-19T08:21:56Z) - Learn to Exceed: Stereo Inverse Reinforcement Learning with Concurrent
Policy Optimization [1.0965065178451106]
We study the problem of obtaining a control policy that can mimic and then outperform expert demonstrations in Markov decision processes.
One main relevant approach is the inverse reinforcement learning (IRL), which mainly focuses on inferring a reward function from expert demonstrations.
We propose a novel method that enables the learning agent to outperform the demonstrator via a new concurrent reward and action policy learning approach.
arXiv Detail & Related papers (2020-09-21T02:16:21Z) - Off-Policy Adversarial Inverse Reinforcement Learning [0.0]
Adversarial Imitation Learning (AIL) is a class of algorithms in Reinforcement learning (RL)
We propose an Off-Policy Adversarial Inverse Reinforcement Learning (Off-policy-AIRL) algorithm which is sample efficient as well as gives good imitation performance.
arXiv Detail & Related papers (2020-05-03T16:51:40Z) - Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement
Learning with Clairvoyant Experts [22.87432549580184]
We formulate this as Bayesian Reinforcement Learning over latent Markov Decision Processes (MDPs)
We first obtain an ensemble of experts, one for each latent MDP, and fuse their advice to compute a baseline policy.
Next, we train a Bayesian residual policy to improve upon the ensemble's recommendation and learn to reduce uncertainty.
BRPO significantly improves the ensemble of experts and drastically outperforms existing adaptive RL methods.
arXiv Detail & Related papers (2020-02-07T23:10:05Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.