State-only Imitation with Transition Dynamics Mismatch
- URL: http://arxiv.org/abs/2002.11879v1
- Date: Thu, 27 Feb 2020 02:27:46 GMT
- Title: State-only Imitation with Transition Dynamics Mismatch
- Authors: Tanmay Gangwani, Jian Peng
- Abstract summary: Imitation Learning (IL) is a popular paradigm for training agents to achieve complicated goals by leveraging expert behavior.
We present a new state-only IL algorithm in this paper.
We show that our algorithm is particularly effective when there is a transition dynamics mismatch between the expert and imitator MDPs.
- Score: 16.934888672659824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imitation Learning (IL) is a popular paradigm for training agents to achieve
complicated goals by leveraging expert behavior, rather than dealing with the
hardships of designing a correct reward function. With the environment modeled
as a Markov Decision Process (MDP), most of the existing IL algorithms are
contingent on the availability of expert demonstrations in the same MDP as the
one in which a new imitator policy is to be learned. This is uncharacteristic
of many real-life scenarios where discrepancies between the expert and the
imitator MDPs are common, especially in the transition dynamics function.
Furthermore, obtaining expert actions may be costly or infeasible, making the
recent trend towards state-only IL (where expert demonstrations constitute only
states or observations) ever so promising. Building on recent adversarial
imitation approaches that are motivated by the idea of divergence minimization,
we present a new state-only IL algorithm in this paper. It divides the overall
optimization objective into two subproblems by introducing an indirection step
and solves the subproblems iteratively. We show that our algorithm is
particularly effective when there is a transition dynamics mismatch between the
expert and imitator MDPs, while the baseline IL methods suffer from performance
degradation. To analyze this, we construct several interesting MDPs by
modifying the configuration parameters for the MuJoCo locomotion tasks from
OpenAI Gym.
Related papers
- FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL [19.236153474365747]
Existing MARL approaches often rely on the restrictive assumption that the number of entities remains constant between training and inference.
In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization.
We propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods.
arXiv Detail & Related papers (2024-10-21T10:57:45Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - Imitator Learning: Achieve Out-of-the-Box Imitation Ability in Variable
Environments [45.213059639254475]
We propose a new topic called imitator learning (ItorL)
It aims to derive an imitator module that can reconstruct the imitation policies based on very limited expert demonstrations.
For autonomous imitation policy building, we design a demonstration-based attention architecture for imitator policy.
arXiv Detail & Related papers (2023-10-09T13:35:28Z) - Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Cross-domain Imitation from Observations [50.669343548588294]
Imitation learning seeks to circumvent the difficulty in designing proper reward functions for training agents by utilizing expert behavior.
In this paper, we study the problem of how to imitate tasks when there exist discrepancies between the expert and agent MDP.
We present a novel framework to learn correspondences across such domains.
arXiv Detail & Related papers (2021-05-20T21:08:25Z) - Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via
Online High-Confidence Change-Point Detection [7.685002911021767]
We introduce an algorithm that efficiently learns policies in non-stationary environments.
It analyzes a possibly infinite stream of data and computes, in real-time, high-confidence change-point detection statistics.
We show that (i) this algorithm minimizes the delay until unforeseen changes to a context are detected, thereby allowing for rapid responses.
arXiv Detail & Related papers (2021-05-20T01:57:52Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.