Minimax Optimal Online Imitation Learning via Replay Estimation
- URL: http://arxiv.org/abs/2205.15397v2
- Date: Thu, 2 Jun 2022 15:44:22 GMT
- Title: Minimax Optimal Online Imitation Learning via Replay Estimation
- Authors: Gokul Swamy, Nived Rajaraman, Matthew Peng, Sanjiban Choudhury, J.
Andrew Bagnell, Zhiwei Steven Wu, Jiantao Jiao, Kannan Ramchandran
- Abstract summary: We introduce a technique of replay estimation to reduce this empirical variance.
We show that our approach achieves the optimal $widetildeO left( min(H3/2 / N, H / sqrtN$)$ dependency.
- Score: 47.83919594113314
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Online imitation learning is the problem of how best to mimic expert
demonstrations, given access to the environment or an accurate simulator. Prior
work has shown that in the infinite sample regime, exact moment matching
achieves value equivalence to the expert policy. However, in the finite sample
regime, even if one has no optimization error, empirical variance can lead to a
performance gap that scales with $H^2 / N$ for behavioral cloning and $H /
\sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the
size of the expert dataset. We introduce the technique of replay estimation to
reduce this empirical variance: by repeatedly executing cached expert actions
in a stochastic simulator, we compute a smoother expert visitation distribution
estimate to match. In the presence of general function approximation, we prove
a meta theorem reducing the performance gap of our approach to the parameter
estimation error for offline classification (i.e. learning the expert policy).
In the tabular setting or with linear function approximation, our meta theorem
shows that the performance gap incurred by our approach achieves the optimal
$\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$
dependency, under significantly weaker assumptions compared to prior work. We
implement multiple instantiations of our approach on several continuous control
tasks and find that we are able to significantly improve policy performance
across a variety of dataset sizes.
Related papers
- Theoretical limits of descending $\ell_0$ sparse-regression ML algorithms [0.0]
We develop a generic analytical program for studying performance of the emphmaximum-likelihood (ML) decoding.
Key ML performance parameter, the residual emphroot mean square error ($textbfRMSE$) is uncovered to exhibit the so-called emphphase-transition (PT) phenomenon.
Concrete implementation and practical relevance of the Fl RDT typically rely on the ability to conduct a sizeable set of the underlying numerical evaluations.
arXiv Detail & Related papers (2024-10-10T06:33:41Z) - Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization [11.11876897168701]
We propose an algorithm that achieves a regret bound of order $tildemathcalO(mathrmpoly(H)sqrtSAT)$.
The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures.
arXiv Detail & Related papers (2024-07-08T08:06:45Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Efficient Minimax Optimal Estimators For Multivariate Convex Regression [1.583842747998493]
We present the first computationally efficient minimax optimal (up to logarithmic factors) estimators for the tasks of (i) $L$-Lipschitz convex regression (ii) $Gamma$-bounded convex regression undertopal support.
This work is the first to show the existence of efficient minimax optimal estimators for non-Donsker classes that their corresponding Least Squares Estimators are provably minimax sub-optimal.
arXiv Detail & Related papers (2022-05-06T17:04:05Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs.
This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.