Provably Efficient Generative Adversarial Imitation Learning for Online
and Offline Setting with Linear Function Approximation
- URL: http://arxiv.org/abs/2108.08765v1
- Date: Thu, 19 Aug 2021 16:16:00 GMT
- Title: Provably Efficient Generative Adversarial Imitation Learning for Online
and Offline Setting with Linear Function Approximation
- Authors: Zhihan Liu, Yufeng Zhang, Zuyue Fu, Zhuoran Yang, and Zhaoran Wang
- Abstract summary: In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain reward set.
We study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps.
- Score: 81.0955457177017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In generative adversarial imitation learning (GAIL), the agent aims to learn
a policy from an expert demonstration so that its performance cannot be
discriminated from the expert policy on a certain predefined reward set. In
this paper, we study GAIL in both online and offline settings with linear
function approximation, where both the transition and reward function are
linear in the feature maps. Besides the expert demonstration, in the online
setting the agent can interact with the environment, while in the offline
setting the agent only accesses an additional dataset collected by a prior. For
online GAIL, we propose an optimistic generative adversarial policy
optimization algorithm (OGAP) and prove that OGAP achieves
$\widetilde{\mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$ regret. Here
$N_1$ represents the number of trajectories of the expert demonstration, $d$ is
the feature dimension, and $K$ is the number of episodes.
For offline GAIL, we propose a pessimistic generative adversarial policy
optimization algorithm (PGAP). For an arbitrary additional dataset, we obtain
the optimality gap of PGAP, achieving the minimax lower bound in the
utilization of the additional dataset. Assuming sufficient coverage on the
additional dataset, we show that PGAP achieves
$\widetilde{\mathcal{O}}(H^{2}dK^{-1/2}
+H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} \ )$ optimality gap. Here $N_2$
represents the number of trajectories of the additional dataset with sufficient
coverage.
Related papers
- A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs [18.449996575976993]
We propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting.
Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(epsilon-2)$ with partial data coverage assumption.
We extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
arXiv Detail & Related papers (2024-02-07T00:33:11Z) - Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Minimax Optimal Online Imitation Learning via Replay Estimation [47.83919594113314]
We introduce a technique of replay estimation to reduce this empirical variance.
We show that our approach achieves the optimal $widetildeO left( min(H3/2 / N, H / sqrtN$)$ dependency.
arXiv Detail & Related papers (2022-05-30T19:29:56Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.