Related papers: Nearly Minimax Optimal Adversarial Imitation Learning with Known and Unknown Transitions

Nearly Minimax Optimal Adversarial Imitation Learning with Known and Unknown Transitions

URL: http://arxiv.org/abs/2106.10424v1
Date: Sat, 19 Jun 2021 04:41:33 GMT
Title: Nearly Minimax Optimal Adversarial Imitation Learning with Known and Unknown Transitions
Authors: Tian Xu, Ziniu Li, Yang Yu
Abstract summary: This paper is dedicated to designing provably efficient adversarial imitation learning (AIL) algorithms that directly optimize policies from expert demonstrations. We develop a transition-aware AIL algorithm named TAIL with an expert sample complexity of $tildeO(H3/2 |S|/varepsilon)$ under the known transition setting. In particular, MB-TAIL builds an empirical transition model by interacting with the environment and performs imitation under the recovered empirical model.
Score: 13.9603281084922
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper is dedicated to designing provably efficient adversarial imitation learning (AIL) algorithms that directly optimize policies from expert demonstrations. Firstly, we develop a transition-aware AIL algorithm named TAIL with an expert sample complexity of $\tilde{O}(H^{3/2} |S|/\varepsilon)$ under the known transition setting, where $H$ is the planning horizon, $|S|$ is the state space size and $\varepsilon$ is desired policy value gap. This improves upon the previous best bound of $\tilde{O}(H^2 |S| / \varepsilon^2)$ for AIL methods and matches the lower bound of $\tilde{\Omega} (H^{3/2} |S|/\varepsilon)$ in [Rajaraman et al., 2021] up to a logarithmic factor. The key ingredient of TAIL is a fine-grained estimator for expert state-action distribution, which explicitly utilizes the transition function information. Secondly, considering practical settings where the transition functions are usually unknown but environment interaction is allowed, we accordingly develop a model-based transition-aware AIL algorithm named MB-TAIL. In particular, MB-TAIL builds an empirical transition model by interacting with the environment and performs imitation under the recovered empirical model. The interaction complexity of MB-TAIL is $\tilde{O} (H^3 |S|^2 |A| / \varepsilon^2)$, which improves the best known result of $\tilde{O} (H^4 |S|^2 |A| / \varepsilon^2)$ in [Shani et al., 2021]. Finally, our theoretical results are supported by numerical evaluation and detailed analysis on two challenging MDPs.

Related papers

Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation [8.274693573069442]
We study reinforcement learning with multinomial logistic (MNL) function approximation. We propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. Numerical experiments demonstrate the superior performance of the proposed algorithms.
arXiv Detail & Related papers (2024-05-30T15:39:19Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Provably Efficient Adversarial Imitation Learning with Unknown Transitions [24.70187647541753]
Imitation learning (IL) has proven to be an effective method for learning good policies from expert demonstrations. This paper explores the theoretical underpinnings of AIL in the presence of unknown transitions. We propose an algorithm, MB-TAIL, that achieves the minimax optimal expert sample complexity of $widetildeO (H3/2 |S|/varepsilon)$ and interaction complexity of $widetildeO (H3 |S|2 |A|/varepsilon2)$.
arXiv Detail & Related papers (2023-06-11T02:46:41Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs) In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z)
Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$. We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020) We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.