Nearly Minimax Optimal Adversarial Imitation Learning with Known and
Unknown Transitions
- URL: http://arxiv.org/abs/2106.10424v1
- Date: Sat, 19 Jun 2021 04:41:33 GMT
- Title: Nearly Minimax Optimal Adversarial Imitation Learning with Known and
Unknown Transitions
- Authors: Tian Xu, Ziniu Li, Yang Yu
- Abstract summary: This paper is dedicated to designing provably efficient adversarial imitation learning (AIL) algorithms that directly optimize policies from expert demonstrations.
We develop a transition-aware AIL algorithm named TAIL with an expert sample complexity of $tildeO(H3/2 |S|/varepsilon)$ under the known transition setting.
In particular, MB-TAIL builds an empirical transition model by interacting with the environment and performs imitation under the recovered empirical model.
- Score: 13.9603281084922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper is dedicated to designing provably efficient adversarial imitation
learning (AIL) algorithms that directly optimize policies from expert
demonstrations. Firstly, we develop a transition-aware AIL algorithm named TAIL
with an expert sample complexity of $\tilde{O}(H^{3/2} |S|/\varepsilon)$ under
the known transition setting, where $H$ is the planning horizon, $|S|$ is the
state space size and $\varepsilon$ is desired policy value gap. This improves
upon the previous best bound of $\tilde{O}(H^2 |S| / \varepsilon^2)$ for AIL
methods and matches the lower bound of $\tilde{\Omega} (H^{3/2}
|S|/\varepsilon)$ in [Rajaraman et al., 2021] up to a logarithmic factor. The
key ingredient of TAIL is a fine-grained estimator for expert state-action
distribution, which explicitly utilizes the transition function information.
Secondly, considering practical settings where the transition functions are
usually unknown but environment interaction is allowed, we accordingly develop
a model-based transition-aware AIL algorithm named MB-TAIL. In particular,
MB-TAIL builds an empirical transition model by interacting with the
environment and performs imitation under the recovered empirical model. The
interaction complexity of MB-TAIL is $\tilde{O} (H^3 |S|^2 |A| /
\varepsilon^2)$, which improves the best known result of $\tilde{O} (H^4 |S|^2
|A| / \varepsilon^2)$ in [Shani et al., 2021]. Finally, our theoretical results
are supported by numerical evaluation and detailed analysis on two challenging
MDPs.
Related papers
- Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation [8.274693573069442]
We study reinforcement learning with multinomial logistic (MNL) function approximation.
We propose provably efficient algorithms with randomized exploration having frequentist regret guarantees.
Numerical experiments demonstrate the superior performance of the proposed algorithms.
arXiv Detail & Related papers (2024-05-30T15:39:19Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Provably Efficient Adversarial Imitation Learning with Unknown
Transitions [24.70187647541753]
Imitation learning (IL) has proven to be an effective method for learning good policies from expert demonstrations.
This paper explores the theoretical underpinnings of AIL in the presence of unknown transitions.
We propose an algorithm, MB-TAIL, that achieves the minimax optimal expert sample complexity of $widetildeO (H3/2 |S|/varepsilon)$ and interaction complexity of $widetildeO (H3 |S|2 |A|/varepsilon2)$.
arXiv Detail & Related papers (2023-06-11T02:46:41Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.