A Bayesian Solution To The Imitation Gap
- URL: http://arxiv.org/abs/2407.00495v1
- Date: Sat, 29 Jun 2024 17:13:37 GMT
- Title: A Bayesian Solution To The Imitation Gap
- Authors: Risto Vuorio, Mattie Fellows, Cong Lu, Clémence Grislain, Shimon Whiteson,
- Abstract summary: An agent must learn to act in environments where no reward signal can be specified.
In some cases, differences in observability between the expert and the agent can give rise to an imitation gap.
- Score: 34.16107600758348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many real-world settings, an agent must learn to act in environments where no reward signal can be specified, but a set of expert demonstrations is available. Imitation learning (IL) is a popular framework for learning policies from such demonstrations. However, in some cases, differences in observability between the expert and the agent can give rise to an imitation gap such that the expert's policy is not optimal for the agent and a naive application of IL can fail catastrophically. In particular, if the expert observes the Markov state and the agent does not, then the expert will not demonstrate the information-gathering behavior needed by the agent but not the expert. In this paper, we propose a Bayesian solution to the Imitation Gap (BIG), first using the expert demonstrations, together with a prior specifying the cost of exploratory behavior that is not demonstrated, to infer a posterior over rewards with Bayesian inverse reinforcement learning (IRL). BIG then uses the reward posterior to learn a Bayes-optimal policy. Our experiments show that BIG, unlike IL, allows the agent to explore at test time when presented with an imitation gap, whilst still learning to behave optimally using expert demonstrations when no such gap exists.
Related papers
- Offline Imitation Learning with Model-based Reverse Augmentation [48.64791438847236]
We propose a novel model-based framework, called offline Imitation Learning with Self-paced Reverse Augmentation.
Specifically, we build a reverse dynamic model from the offline demonstrations, which can efficiently generate trajectories leading to the expert-observed states.
We use the subsequent reinforcement learning method to learn from the augmented trajectories and transit from expert-unobserved states to expert-observed states.
arXiv Detail & Related papers (2024-06-18T12:27:02Z) - Multi-Agent Imitation Learning: Value is Easy, Regret is Hard [52.31989962031179]
We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents.
Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations.
While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee to deviations by strategic agents.
arXiv Detail & Related papers (2024-06-06T16:18:20Z) - Expert Proximity as Surrogate Rewards for Single Demonstration Imitation Learning [51.972577689963714]
Single-demonstration imitation learning (IL) is a practical approach for real-world applications where acquiring multiple expert demonstrations is costly or infeasible.
In contrast to typical IL settings, single-demonstration IL involves an agent having access to only one expert trajectory.
We highlight the issue of sparse reward signals in this setting and propose to mitigate this issue through our proposed Transition Discriminator-based IL (TDIL) method.
arXiv Detail & Related papers (2024-02-01T23:06:19Z) - Inverse Reinforcement Learning with Sub-optimal Experts [56.553106680769474]
We study the theoretical properties of the class of reward functions that are compatible with a given set of experts.
Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards.
We analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to the one of the optimal agent.
arXiv Detail & Related papers (2024-01-08T12:39:25Z) - Unlabeled Imperfect Demonstrations in Adversarial Imitation Learning [48.595574101874575]
In the real world, expert demonstrations are more likely to be imperfect.
A positive-unlabeled adversarial imitation learning algorithm is developed.
Agent policy will be optimized to cheat the discriminator and produce trajectories similar to those optimal expert demonstrations.
arXiv Detail & Related papers (2023-02-13T11:26:44Z) - Deconfounding Imitation Learning with Variational Inference [19.99248795957195]
Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent.
This is because partial observability gives rise to hidden confounders in the causal graph.
We propose to train a variational inference model to infer the expert's latent information and use it to train a latent-conditional policy.
arXiv Detail & Related papers (2022-11-04T18:00:02Z) - Safe Driving via Expert Guided Policy Optimization [38.68691065718655]
Expert-in-the-loop Reinforcement Learning is used to safeguard the exploration of the learning agent.
We develop a novel Expert Guided Policy Optimization (EGPO) method which integrates the guardian in the loop of reinforcement learning.
Our method achieves superior training and test-time safety, outperforms baselines with a substantial margin in sample efficiency, and preserves the generalizabiliy to unseen environments in test-time.
arXiv Detail & Related papers (2021-10-13T16:19:03Z) - Blind Exploration and Exploitation of Stochastic Experts [7.106986689736826]
We present blind exploration and exploitation (BEE) algorithms for identifying the most reliable expert based on formulations that employ posterior sampling, upper-confidence bounds, empirical Kullback-Leibler divergence, and minmax methods for the multi-armed bandit problem.
We propose an empirically realizable measure of expert competence that can be instantaneously using only the opinions of other experts.
arXiv Detail & Related papers (2021-04-02T15:02:02Z) - Co-Imitation Learning without Expert Demonstration [39.988945772085465]
We propose a novel learning framework called Co-Imitation Learning (CoIL) to exploit the past good experiences of the agents without expert demonstration.
While the experiences could be valuable or misleading, we propose to estimate the potential utility of each piece of experience with the expected gain of the value function.
Experimental results on various tasks show significant superiority of the proposed Co-Imitation Learning framework.
arXiv Detail & Related papers (2021-03-27T06:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.