Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally
- URL: http://arxiv.org/abs/2102.12948v1
- Date: Thu, 25 Feb 2021 15:50:19 GMT
- Title: Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally
- Authors: Nived Rajaraman, Yanjun Han, Lin F. Yang, Kannan Ramchandran, Jiantao
Jiao
- Abstract summary: We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
- Score: 58.463668865380946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the statistical limits of Imitation Learning (IL) in episodic Markov
Decision Processes (MDPs) with a state space $\mathcal{S}$. We focus on the
known-transition setting where the learner is provided a dataset of $N$
length-$H$ trajectories from a deterministic expert policy and knows the MDP
transition. We establish an upper bound $O(|\mathcal{S}|H^{3/2}/N)$ for the
suboptimality using the Mimic-MD algorithm in Rajaraman et al (2020) which we
prove to be computationally efficient. In contrast, we show the minimax
suboptimality grows as $\Omega( H^{3/2}/N)$ when $|\mathcal{S}|\geq 3$ while
the unknown-transition setting suffers from a larger sharp rate
$\Theta(|\mathcal{S}|H^2/N)$ (Rajaraman et al (2020)). The lower bound is
established by proving a two-way reduction between IL and the value estimation
problem of the unknown expert policy under any given reward function, as well
as building connections with linear functional estimation with subsampled
observations. We further show that under the additional assumption that the
expert is optimal for the true reward function, there exists an efficient
algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality
$O(1/N)$ for arbitrary 3-state MDPs with rewards only at the terminal layer. In
contrast, no algorithm can achieve suboptimality $O(\sqrt{H}/N)$ with high
probability if the expert is not constrained to be optimal. Our work formally
establishes the benefit of the expert optimal assumption in the known
transition setting, while Rajaraman et al (2020) showed it does not help when
transitions are unknown.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.