Sequence Model Imitation Learning with Unobserved Contexts
- URL: http://arxiv.org/abs/2208.02225v1
- Date: Wed, 3 Aug 2022 17:27:44 GMT
- Title: Sequence Model Imitation Learning with Unobserved Contexts
- Authors: Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, Zhiwei Steven Wu
- Abstract summary: We consider imitation learning problems where the expert has access to a per-episode context hidden from the learner.
We show that on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.
- Score: 39.4969161422156
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We consider imitation learning problems where the expert has access to a
per-episode context that is hidden from the learner, both in the demonstrations
and at test-time. While the learner might not be able to accurately reproduce
expert behavior early on in an episode, by considering the entire history of
states and actions, they might be able to eventually identify the context and
act as the expert would. We prove that on-policy imitation learning algorithms
(with or without access to a queryable expert) are better equipped to handle
these sorts of asymptotically realizable problems than off-policy methods and
are able to avoid the latching behavior (naive repetition of past actions) that
plagues the latter. We conduct experiments in a toy bandit domain that show
that there exist sharp phase transitions of whether off-policy approaches are
able to match expert performance asymptotically, in contrast to the uniformly
good performance of on-policy approaches. We demonstrate that on several
continuous control tasks, on-policy approaches are able to use history to
identify the context while off-policy approaches actually perform worse when
given access to history.
Related papers
- MEGA-DAgger: Imitation Learning with Multiple Imperfect Experts [7.4506213369860195]
MEGA-DAgger is a new DAgger variant that is suitable for interactive learning with multiple imperfect experts.
We demonstrate that policy learned using MEGA-DAgger can outperform both experts and policies learned using the state-of-the-art interactive imitation learning algorithms.
arXiv Detail & Related papers (2023-03-01T16:40:54Z) - Deconfounding Imitation Learning with Variational Inference [19.99248795957195]
Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent.
This is because partial observability gives rise to hidden confounders in the causal graph.
We propose to train a variational inference model to infer the expert's latent information and use it to train a latent-conditional policy.
arXiv Detail & Related papers (2022-11-04T18:00:02Z) - Causal Imitation Learning with Unobserved Confounders [82.22545916247269]
We study imitation learning when sensory inputs of the learner and the expert differ.
We show that imitation could still be feasible by exploiting quantitative knowledge of the expert trajectories.
arXiv Detail & Related papers (2022-08-12T13:29:53Z) - Online Learning with Off-Policy Feedback [18.861989132159945]
We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback.
We propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy.
arXiv Detail & Related papers (2022-07-18T21:57:16Z) - Chain of Thought Imitation with Procedure Cloning [129.62135987416164]
We propose procedure cloning, which applies supervised sequence prediction to imitate the series of expert computations.
We show that imitating the intermediate computations of an expert's behavior enables procedure cloning to learn policies exhibiting significant generalization to unseen environment configurations.
arXiv Detail & Related papers (2022-05-22T13:14:09Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - Learning "What-if" Explanations for Sequential Decision-Making [92.8311073739295]
Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior is essential.
We propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes.
We highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior.
arXiv Detail & Related papers (2020-07-02T14:24:17Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.