Related papers: Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

URL: http://arxiv.org/abs/2006.13258v6
Date: Fri, 16 Apr 2021 10:09:13 GMT
Title: Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization
Authors: Paul Barde, Julien Roy, Wonseok Jeon, Joelle Pineau, Christopher Pal, Derek Nowrouzezahrai
Abstract summary: Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation.
Score: 48.674944885529165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop. This formulation effectively cuts by half the implementation and computational burden of Adversarial Imitation Learning algorithms by removing the Reinforcement Learning phase altogether. We show on a variety of tasks that our simpler approach is competitive to prevalent Imitation Learning methods.

Related papers

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching [23.600285251963395]
In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimize the reward through repeated RL procedures. We propose a novel approach to IRL by direct policy optimization, exploiting a linear factorization of the return as the inner product of successor features and a reward vector.
arXiv Detail & Related papers (2024-11-11T14:05:50Z)
Self-Improvement for Neural Combinatorial Optimization: Sample without Replacement, but Improvement [1.1510009152620668]
Current methods for constructive neural optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. We bridge the two by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data.
arXiv Detail & Related papers (2024-03-22T13:09:10Z)
Prompt Optimization via Adversarial In-Context Learning [51.18075178593142]
adv-ICL is implemented as a two-player game between a generator and a discriminator. The generator tries to generate realistic enough output to fool the discriminator. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques.
arXiv Detail & Related papers (2023-12-05T09:44:45Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization. Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z)
State Augmented Constrained Reinforcement Learning: Overcoming the Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z)
Non-Adversarial Imitation Learning and its Connections to Adversarial Methods [21.89749623434729]
We present a framework for non-adversarial imitation learning. The resulting algorithms are similar to their adversarial counterparts. We also show that our non-adversarial formulation can be used to derive novel algorithms.
arXiv Detail & Related papers (2020-08-08T13:43:06Z)
Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.