Adversarial Soft Advantage Fitting: Imitation Learning without Policy
Optimization
- URL: http://arxiv.org/abs/2006.13258v6
- Date: Fri, 16 Apr 2021 10:09:13 GMT
- Title: Adversarial Soft Advantage Fitting: Imitation Learning without Policy
Optimization
- Authors: Paul Barde, Julien Roy, Wonseok Jeon, Joelle Pineau, Christopher Pal,
Derek Nowrouzezahrai
- Abstract summary: Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator.
We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation.
- Score: 48.674944885529165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial Imitation Learning alternates between learning a discriminator --
which tells apart expert's demonstrations from generated ones -- and a
generator's policy to produce trajectories that can fool this discriminator.
This alternated optimization is known to be delicate in practice since it
compounds unstable adversarial training with brittle and sample-inefficient
reinforcement learning. We propose to remove the burden of the policy
optimization steps by leveraging a novel discriminator formulation.
Specifically, our discriminator is explicitly conditioned on two policies: the
one from the previous generator's iteration and a learnable policy. When
optimized, this discriminator directly learns the optimal generator's policy.
Consequently, our discriminator's update solves the generator's optimization
problem for free: learning a policy that imitates the expert does not require
an additional optimization loop. This formulation effectively cuts by half the
implementation and computational burden of Adversarial Imitation Learning
algorithms by removing the Reinforcement Learning phase altogether. We show on
a variety of tasks that our simpler approach is competitive to prevalent
Imitation Learning methods.
Related papers
- Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching [23.600285251963395]
In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment.
Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimize the reward through repeated RL procedures.
We propose a novel approach to IRL by direct policy optimization, exploiting a linear factorization of the return as the inner product of successor features and a reward vector.
arXiv Detail & Related papers (2024-11-11T14:05:50Z) - Self-Improvement for Neural Combinatorial Optimization: Sample without Replacement, but Improvement [1.1510009152620668]
Current methods for constructive neural optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning.
We bridge the two by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning.
We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data.
arXiv Detail & Related papers (2024-03-22T13:09:10Z) - Prompt Optimization via Adversarial In-Context Learning [51.18075178593142]
adv-ICL is implemented as a two-player game between a generator and a discriminator.
The generator tries to generate realistic enough output to fool the discriminator.
We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques.
arXiv Detail & Related papers (2023-12-05T09:44:45Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Non-Adversarial Imitation Learning and its Connections to Adversarial
Methods [21.89749623434729]
We present a framework for non-adversarial imitation learning.
The resulting algorithms are similar to their adversarial counterparts.
We also show that our non-adversarial formulation can be used to derive novel algorithms.
arXiv Detail & Related papers (2020-08-08T13:43:06Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.