Toward the Fundamental Limits of Imitation Learning
- URL: http://arxiv.org/abs/2009.05990v1
- Date: Sun, 13 Sep 2020 12:45:52 GMT
- Title: Toward the Fundamental Limits of Imitation Learning
- Authors: Nived Rajaraman, Lin F. Yang, Jiantao Jiao, Kannan Ramachandran
- Abstract summary: Imitation learning (IL) aims to mimic the behavior of an expert policy in a sequential decision-making problem given only demonstrations.
We first consider the setting where the learner is provided a dataset of $N$ expert trajectories ahead of time, and cannot interact with the MDP.
We show that the policy which mimics the expert whenever possible is in expectation $lesssim frac|mathcalS| H2 log (N)N$ suboptimal compared to the value of the expert, even when the expert follows an arbitrary policy.
- Score: 29.87139380803829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imitation learning (IL) aims to mimic the behavior of an expert policy in a
sequential decision-making problem given only demonstrations. In this paper, we
focus on understanding the minimax statistical limits of IL in episodic Markov
Decision Processes (MDPs). We first consider the setting where the learner is
provided a dataset of $N$ expert trajectories ahead of time, and cannot
interact with the MDP. Here, we show that the policy which mimics the expert
whenever possible is in expectation $\lesssim \frac{|\mathcal{S}| H^2 \log
(N)}{N}$ suboptimal compared to the value of the expert, even when the expert
follows an arbitrary stochastic policy. Here $\mathcal{S}$ is the state space,
and $H$ is the length of the episode. Furthermore, we establish a suboptimality
lower bound of $\gtrsim |\mathcal{S}| H^2 / N$ which applies even if the expert
is constrained to be deterministic, or if the learner is allowed to actively
query the expert at visited states while interacting with the MDP for $N$
episodes. To our knowledge, this is the first algorithm with suboptimality
having no dependence on the number of actions, under no additional assumptions.
We then propose a novel algorithm based on minimum-distance functionals in the
setting where the transition model is given and the expert is deterministic.
The algorithm is suboptimal by $\lesssim \min \{ H \sqrt{|\mathcal{S}| / N} ,\
|\mathcal{S}| H^{3/2} / N \}$, showing that knowledge of transition improves
the minimax rate by at least a $\sqrt{H}$ factor.
Related papers
- A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Optimism in Face of a Context: Regret Guarantees for Stochastic
Contextual MDP [46.86114958340962]
We present regret algorithms for contextual MDPs under minimum reachability assumption.
Our approach is the first optimistic approach applied to contextual MDPs with general function approximation.
arXiv Detail & Related papers (2022-07-22T15:00:15Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.