Energy-Based Imitation Learning
- URL: http://arxiv.org/abs/2004.09395v4
- Date: Thu, 15 Apr 2021 04:30:35 GMT
- Title: Energy-Based Imitation Learning
- Authors: Minghuan Liu, Tairan He, Minkai Xu, Weinan Zhang
- Abstract summary: We tackle a common scenario in imitation learning (IL) where agents try to recover the optimal policy from expert demonstrations.
Inspired by recent progress in energy-based model (EBM), in this paper we propose a simplified IL framework named Energy-Based Imitation Learning (EBIL)
EBIL combines the idea of both EBM and occupancy measure matching, and via theoretic analysis we reveal that EBIL and Max-Entropy IRL (MaxEnt IRL) approaches are two sides of the same coin.
- Score: 29.55675131809474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle a common scenario in imitation learning (IL), where agents try to
recover the optimal policy from expert demonstrations without further access to
the expert or environment reward signals. Except the simple Behavior Cloning
(BC) that adopts supervised learning followed by the problem of compounding
error, previous solutions like inverse reinforcement learning (IRL) and recent
generative adversarial methods involve a bi-level or alternating optimization
for updating the reward function and the policy, suffering from high
computational cost and training instability. Inspired by recent progress in
energy-based model (EBM), in this paper, we propose a simplified IL framework
named Energy-Based Imitation Learning (EBIL). Instead of updating the reward
and policy iteratively, EBIL breaks out of the traditional IRL paradigm by a
simple and flexible two-stage solution: first estimating the expert energy as
the surrogate reward function through score matching, then utilizing such a
reward for learning the policy by reinforcement learning algorithms. EBIL
combines the idea of both EBM and occupancy measure matching, and via theoretic
analysis we reveal that EBIL and Max-Entropy IRL (MaxEnt IRL) approaches are
two sides of the same coin, and thus EBIL could be an alternative of
adversarial IRL methods. Extensive experiments on qualitative and quantitative
evaluations indicate that EBIL is able to recover meaningful and interpretative
reward signals while achieving effective and comparable performance against
existing algorithms on IL benchmarks.
Related papers
- Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations.
We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games.
Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - EvIL: Evolution Strategies for Generalisable Imitation Learning [33.745657379141676]
In imitation learning (IL) expert demonstrations and the environment we want to deploy our learned policy in aren't exactly the same.
Compared to policy-centric approaches to IL like cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments.
We find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in.
We propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment.
arXiv Detail & Related papers (2024-06-15T22:46:39Z) - CLARE: Conservative Model-Based Reward Learning for Offline Inverse
Reinforcement Learning [26.05184273238923]
This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL)
We devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function.
Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy.
arXiv Detail & Related papers (2023-02-09T17:16:29Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z) - Reinforcement Learning through Active Inference [62.997667081978825]
We show how ideas from active inference can augment traditional reinforcement learning approaches.
We develop and implement a novel objective for decision making, which we term the free energy of the expected future.
We demonstrate that the resulting algorithm successfully exploration and exploitation, simultaneously achieving robust performance on several challenging RL benchmarks with sparse, well-shaped, and no rewards.
arXiv Detail & Related papers (2020-02-28T10:28:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.