Provably Breaking the Quadratic Error Compounding Barrier in Imitation
  Learning, Optimally
        - URL: http://arxiv.org/abs/2102.12948v1
- Date: Thu, 25 Feb 2021 15:50:19 GMT
- Title: Provably Breaking the Quadratic Error Compounding Barrier in Imitation
  Learning, Optimally
- Authors: Nived Rajaraman, Yanjun Han, Lin F. Yang, Kannan Ramchandran, Jiantao
  Jiao
- Abstract summary: We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
- Score: 58.463668865380946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   We study the statistical limits of Imitation Learning (IL) in episodic Markov
Decision Processes (MDPs) with a state space $\mathcal{S}$. We focus on the
known-transition setting where the learner is provided a dataset of $N$
length-$H$ trajectories from a deterministic expert policy and knows the MDP
transition. We establish an upper bound $O(|\mathcal{S}|H^{3/2}/N)$ for the
suboptimality using the Mimic-MD algorithm in Rajaraman et al (2020) which we
prove to be computationally efficient. In contrast, we show the minimax
suboptimality grows as $\Omega( H^{3/2}/N)$ when $|\mathcal{S}|\geq 3$ while
the unknown-transition setting suffers from a larger sharp rate
$\Theta(|\mathcal{S}|H^2/N)$ (Rajaraman et al (2020)). The lower bound is
established by proving a two-way reduction between IL and the value estimation
problem of the unknown expert policy under any given reward function, as well
as building connections with linear functional estimation with subsampled
observations. We further show that under the additional assumption that the
expert is optimal for the true reward function, there exists an efficient
algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality
$O(1/N)$ for arbitrary 3-state MDPs with rewards only at the terminal layer. In
contrast, no algorithm can achieve suboptimality $O(\sqrt{H}/N)$ with high
probability if the expert is not constrained to be optimal. Our work formally
establishes the benefit of the expert optimal assumption in the known
transition setting, while Rajaraman et al (2020) showed it does not help when
transitions are unknown.
 
      
        Related papers
        - Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
 We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
 arXiv  Detail & Related papers  (2025-07-15T20:19:32Z)
- Randomized Exploration for Reinforcement Learning with Multinomial   Logistic Function Approximation [8.274693573069442]
 We study reinforcement learning with multinomial logistic (MNL) function approximation.
We propose provably efficient algorithms with randomized exploration having frequentist regret guarantees.
 Numerical experiments demonstrate the superior performance of the proposed algorithms.
 arXiv  Detail & Related papers  (2024-05-30T15:39:19Z)
- Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
  Feedback and Unknown Transition [71.33787410075577]
 We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
 arXiv  Detail & Related papers  (2024-03-07T15:03:50Z)
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
  Shortest Path [80.60592344361073]
 We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
 arXiv  Detail & Related papers  (2024-02-14T07:52:00Z)
- Near Sample-Optimal Reduction-based Policy Learning for Average Reward
  MDP [58.13930707612128]
 This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
 arXiv  Detail & Related papers  (2022-12-01T15:57:58Z)
- Optimism in Face of a Context: Regret Guarantees for Stochastic
  Contextual MDP [46.86114958340962]
 We present regret algorithms for contextual MDPs under minimum reachability assumption.
Our approach is the first optimistic approach applied to contextual MDPs with general function approximation.
 arXiv  Detail & Related papers  (2022-07-22T15:00:15Z)
- Nearly Minimax Optimal Reinforcement Learning with Linear Function
  Approximation [25.60689712525918]
 We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
 arXiv  Detail & Related papers  (2022-06-23T06:04:21Z)
- Computationally Efficient Horizon-Free Reinforcement Learning for Linear
  Mixture MDPs [111.75736569611159]
 We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
 arXiv  Detail & Related papers  (2022-05-23T17:59:18Z)
- Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
  Approximation [92.3161051419884]
 We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
 arXiv  Detail & Related papers  (2021-02-17T18:54:08Z)
- Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
  Decision Processes [91.38793800392108]
 We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
 arXiv  Detail & Related papers  (2020-12-15T18:56:46Z)
- Toward the Fundamental Limits of Imitation Learning [29.87139380803829]
 Imitation learning (IL) aims to mimic the behavior of an expert policy in a sequential decision-making problem given only demonstrations.
We first consider the setting where the learner is provided a dataset of $N$ expert trajectories ahead of time, and cannot interact with the MDP.
We show that the policy which mimics the expert whenever possible is in expectation $lesssim frac|mathcalS| H2 log (N)N$ suboptimal compared to the value of the expert, even when the expert follows an arbitrary policy.
 arXiv  Detail & Related papers  (2020-09-13T12:45:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.