Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration
- URL: http://arxiv.org/abs/2506.20307v1
- Date: Wed, 25 Jun 2025 10:39:32 GMT
- Title: Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration
- Authors: Heyang Zhao, Xingrui Yu, David M. Bossens, Ivor W. Tsang, Quanquan Gu,
- Abstract summary: We propose a novel imitation learning algorithm called Imitation Learning with Double Exploration (ILDE)<n>ILDE implements exploration in two aspects: (1) optimistic policy optimization via an exploration bonus that rewards state-action pairs with high uncertainty to potentially improve the convergence to the expert policy, and (2) curiosity-driven exploration of the states that deviate from the demonstration trajectories to potentially yield beyond-expert performance.<n> Empirically, we demonstrate that ILDE outperforms the state-of-the-art imitation learning algorithms in terms of sample efficiency and achieves beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations than in previous work.
- Score: 77.36486933055907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imitation learning is a central problem in reinforcement learning where the goal is to learn a policy that mimics the expert's behavior. In practice, it is often challenging to learn the expert policy from a limited number of demonstrations accurately due to the complexity of the state space. Moreover, it is essential to explore the environment and collect data to achieve beyond-expert performance. To overcome these challenges, we propose a novel imitation learning algorithm called Imitation Learning with Double Exploration (ILDE), which implements exploration in two aspects: (1) optimistic policy optimization via an exploration bonus that rewards state-action pairs with high uncertainty to potentially improve the convergence to the expert policy, and (2) curiosity-driven exploration of the states that deviate from the demonstration trajectories to potentially yield beyond-expert performance. Empirically, we demonstrate that ILDE outperforms the state-of-the-art imitation learning algorithms in terms of sample efficiency and achieves beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations than in previous work. We also provide a theoretical justification of ILDE as an uncertainty-regularized policy optimization method with optimistic exploration, leading to a regret growing sublinearly in the number of episodes.
Related papers
- Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Robust Visual Imitation Learning with Inverse Dynamics Representations [32.806294517277976]
We develop an inverse dynamics state representation learning objective to align the expert environment and the learning environment.
With the abstract state representation, we design an effective reward function, which thoroughly measures the similarity between behavior data and expert data.
Our approach can achieve a near-expert performance in most environments, and significantly outperforms the state-of-the-art visual IL methods and robust IL methods.
arXiv Detail & Related papers (2023-10-22T11:47:35Z) - Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z) - Planning for Sample Efficient Imitation Learning [52.44953015011569]
Current imitation algorithms struggle to achieve high performance and high in-environment sample efficiency simultaneously.
We propose EfficientImitate, a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously.
Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency.
arXiv Detail & Related papers (2022-10-18T05:19:26Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Reinforced Imitation Learning by Free Energy Principle [2.9327503320877457]
Reinforcement Learning (RL) requires a large amount of exploration especially in sparse-reward settings.
Imitation Learning (IL) can learn from expert demonstrations without exploration.
We radically unify RL and IL based on Free Energy Principle (FEP)
arXiv Detail & Related papers (2021-07-25T14:19:29Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Learn to Exceed: Stereo Inverse Reinforcement Learning with Concurrent
Policy Optimization [1.0965065178451106]
We study the problem of obtaining a control policy that can mimic and then outperform expert demonstrations in Markov decision processes.
One main relevant approach is the inverse reinforcement learning (IRL), which mainly focuses on inferring a reward function from expert demonstrations.
We propose a novel method that enables the learning agent to outperform the demonstrator via a new concurrent reward and action policy learning approach.
arXiv Detail & Related papers (2020-09-21T02:16:21Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - Policy Gradient from Demonstration and Curiosity [9.69620214666782]
In this work, an integrated policy gradient algorithm was proposed to boost exploration and facilitate intrinsic reward learning.
The presented algorithm was evaluated on a range of simulated tasks with sparse extrinsic reward signals.
It was found that the agent could imitate the expert's behavior and meanwhile sustain high return.
arXiv Detail & Related papers (2020-04-22T07:57:39Z) - Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations [78.94386823185724]
Imitation learning learns effectively in sparse-rewarded tasks by leveraging the existing expert demonstrations.
In practice, collecting a sufficient amount of expert demonstrations can be prohibitively expensive.
We propose Self-Adaptive Learning (SAIL) that can achieve (near) optimal performance given only a limited number of sub-optimal demonstrations.
arXiv Detail & Related papers (2020-04-01T15:57:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.