Imitation Learning by State-Only Distribution Matching
- URL: http://arxiv.org/abs/2202.04332v1
- Date: Wed, 9 Feb 2022 08:38:50 GMT
- Title: Imitation Learning by State-Only Distribution Matching
- Authors: Damian Boborzi, Christoph-Nikolas Straehle, Jens S. Buchner, Lars
Mikelsons
- Abstract summary: Imitation Learning from observation describes policy learning in a similar way to human learning.
We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
- Score: 2.580765958706854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imitation Learning from observation describes policy learning in a similar
way to human learning. An agent's policy is trained by observing an expert
performing a task. While many state-only imitation learning approaches are
based on adversarial imitation learning, one main drawback is that adversarial
training is often unstable and lacks a reliable convergence estimator. If the
true environment reward is unknown and cannot be used to select the
best-performing model, this can result in bad real-world policy performance. We
propose a non-adversarial learning-from-observations approach, together with an
interpretable convergence and performance metric.
Our training objective minimizes the Kulback-Leibler divergence (KLD) between
the policy and expert state transition trajectories which can be optimized in a
non-adversarial fashion. Such methods demonstrate improved robustness when
learned density models guide the optimization. We further improve the sample
efficiency by rewriting the KLD minimization as the Soft Actor Critic objective
based on a modified reward using additional density models that estimate the
environment's forward and backward dynamics. Finally, we evaluate the
effectiveness of our approach on well-known continuous control environments and
show state-of-the-art performance while having a reliable performance estimator
compared to several recent learning-from-observation methods.
Related papers
- Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning [97.2995389188179]
Recent research has begun to approach large language models (LLMs) unlearning via gradient ascent (GA)
Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning.
We propose several controlling methods that can regulate the extent of excessive unlearning.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Model predictive control-based value estimation for efficient reinforcement learning [6.8237783245324035]
We design an improved reinforcement learning method based on model predictive control that models the environment through a data-driven approach.
Based on the learned environment model, it performs multi-step prediction to estimate the value function and optimize the policy.
The method demonstrates higher learning efficiency, faster convergent speed of strategies tending to the local optimal value, and less sample capacity space required by experience replay buffers.
arXiv Detail & Related papers (2023-10-25T13:55:14Z) - Taming Multi-Agent Reinforcement Learning with Estimator Variance
Reduction [12.94372063457462]
Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms.
It suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state.
We propose an enhancement tool that accommodates any actor-critic MARL method.
arXiv Detail & Related papers (2022-09-02T13:44:00Z) - IL-flOw: Imitation Learning from Observation using Normalizing Flows [28.998176144874193]
We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only.
Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods.
arXiv Detail & Related papers (2022-05-19T00:05:03Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Robust Learning from Observation with Model Misspecification [33.92371002674386]
Imitation learning (IL) is a popular paradigm for training policies in robotic systems.
We propose a robust IL algorithm to learn policies that can effectively transfer to the real environment without fine-tuning.
arXiv Detail & Related papers (2022-02-12T07:04:06Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions.
We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel.
We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.