Mimicking Better by Matching the Approximate Action Distribution
- URL: http://arxiv.org/abs/2306.09805v2
- Date: Fri, 9 Feb 2024 16:04:42 GMT
- Title: Mimicking Better by Matching the Approximate Action Distribution
- Authors: Jo\~ao A. C\^andido Ramos, Lionel Blond\'e, Naoya Takeishi and
Alexandros Kalousis
- Abstract summary: We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
- Score: 48.81067017094468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce MAAD, a novel, sample-efficient on-policy
algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate
reward signal, which can be derived from various sources such as adversarial
games, trajectory matching objectives, or optimal transport criteria. To
compensate for the non-availability of expert actions, we rely on an inverse
dynamics model that infers plausible actions distribution given the expert's
state-state transitions; we regularize the imitator's policy by aligning it to
the inferred action distribution. MAAD leads to significantly improved sample
efficiency and stability. We demonstrate its effectiveness in a number of
MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We
show that it requires considerable fewer interactions to achieve expert
performance, outperforming current state-of-the-art on-policy methods.
Remarkably, MAAD often stands out as the sole method capable of attaining
expert performance levels, underscoring its simplicity and efficacy.
Related papers
- POWQMIX: Weighted Value Factorization with Potentially Optimal Joint Actions Recognition for Cooperative Multi-Agent Reinforcement Learning [17.644279061872442]
Value function factorization methods are commonly used in cooperative multi-agent reinforcement learning.
We propose a potentially optimal joint actions weighted QMIX algorithm.
Experiments in matrix games, predator-prey, and StarCraft II Multi-Agent Challenge environments demonstrate that our algorithm outperforms the state-of-the-art value-based multi-agent reinforcement learning methods.
arXiv Detail & Related papers (2024-05-13T03:27:35Z) - On the Connection between Invariant Learning and Adversarial Training
for Out-of-Distribution Generalization [14.233038052654484]
deep learning models rely on spurious features, which catastrophically fail when generalized to out-of-distribution (OOD) data.
Recent work shows that Invariant Risk Minimization (IRM) is only effective for a certain type of distribution shift while it fails for other cases.
We propose Domainwise Adversarial Training ( DAT), an AT-inspired method for alleviating distribution shift by domain-specific perturbations.
arXiv Detail & Related papers (2022-12-18T13:13:44Z) - Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning.
We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - A Deep Reinforcement Learning Approach to Marginalized Importance
Sampling with the Successor Representation [61.740187363451746]
Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution.
We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy.
We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
arXiv Detail & Related papers (2021-06-12T20:21:38Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions.
We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel.
We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.