Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency
- URL: http://arxiv.org/abs/2112.06054v1
- Date: Sat, 11 Dec 2021 19:36:19 GMT
- Title: Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency
- Authors: Mingfei Sun, Sam Devlin, Katja Hofmann and Shimon Whiteson
- Abstract summary: We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
- Score: 61.03922379081648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sample efficiency is crucial for imitation learning methods to be applicable
in real-world applications. Many studies improve sample efficiency by extending
adversarial imitation to be off-policy regardless of the fact that these
off-policy extensions could either change the original objective or involve
complicated optimization. We revisit the foundation of adversarial imitation
and propose an off-policy sample efficient approach that requires no
adversarial training or min-max optimization. Our formulation capitalizes on
two key insights: (1) the similarity between the Bellman equation and the
stationary state-action distribution equation allows us to derive a novel
temporal difference (TD) learning approach; and (2) the use of a deterministic
policy simplifies the TD learning. Combined, these insights yield a practical
algorithm, Deterministic and Discriminative Imitation (D2-Imitation), which
operates by first partitioning samples into two replay buffers and then
learning a deterministic policy via off-policy reinforcement learning. Our
empirical results show that D2-Imitation is effective in achieving good sample
efficiency, outperforming several off-policy extension approaches of
adversarial imitation on many control tasks.
Related papers
- Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations.
We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games.
Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Distillation Policy Optimization [5.439020425819001]
We introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control.
This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline.
Our results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches.
arXiv Detail & Related papers (2023-02-01T15:59:57Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.