On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function
Estimation in Off-policy Evaluation
- URL: http://arxiv.org/abs/2201.06169v1
- Date: Mon, 17 Jan 2022 01:09:38 GMT
- Title: On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function
Estimation in Off-policy Evaluation
- Authors: Xiaohong Chen, Zhengling Qi
- Abstract summary: We study the off-policy evaluation problem in an infinite-horizon Markov decision process with continuous states and actions.
We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem.
- Score: 1.575865518040625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the off-policy evaluation (OPE) problem in an infinite-horizon
Markov decision process with continuous states and actions. We recast the
$Q$-function estimation into a special form of the nonparametric instrumental
variables (NPIV) estimation problem. We first show that under one mild
condition the NPIV formulation of $Q$-function estimation is well-posed in the
sense of $L^2$-measure of ill-posedness with respect to the data generating
distribution, bypassing a strong assumption on the discount factor $\gamma$
imposed in the recent literature for obtaining the $L^2$ convergence rates of
various $Q$-function estimators. Thanks to this new well-posed property, we
derive the first minimax lower bounds for the convergence rates of
nonparametric estimation of $Q$-function and its derivatives in both sup-norm
and $L^2$-norm, which are shown to be the same as those for the classical
nonparametric regression (Stone, 1982). We then propose a sieve two-stage least
squares estimator and establish its rate-optimality in both norms under some
mild conditions. Our general results on the well-posedness and the minimax
lower bounds are of independent interest to study not only other nonparametric
estimators for $Q$-function but also efficient estimation on the value of any
target policy in off-policy settings.
Related papers
- Multivariate root-n-consistent smoothing parameter free matching estimators and estimators of inverse density weighted expectations [51.000851088730684]
We develop novel modifications of nearest-neighbor and matching estimators which converge at the parametric $sqrt n $-rate.
We stress that our estimators do not involve nonparametric function estimators and in particular do not rely on sample-size dependent parameters smoothing.
arXiv Detail & Related papers (2024-07-11T13:28:34Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - Minimax Instrumental Variable Regression and $L_2$ Convergence
Guarantees without Identification or Closedness [71.42652863687117]
We study nonparametric estimation of instrumental variable (IV) regressions.
We propose a new penalized minimax estimator that can converge to a fixed IV solution.
We derive a strong $L$ error rate for our estimator under lax conditions.
arXiv Detail & Related papers (2023-02-10T18:08:49Z) - Policy evaluation from a single path: Multi-step methods, mixing and
mis-specification [45.88067550131531]
We study non-parametric estimation of the value function of an infinite-horizon $gamma$-discounted Markov reward process.
We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference estimates.
arXiv Detail & Related papers (2022-11-07T23:15:25Z) - Sample Complexity of Nonparametric Off-Policy Evaluation on
Low-Dimensional Manifolds using Deep Networks [71.95722100511627]
We consider the off-policy evaluation problem of reinforcement learning using deep neural networks.
We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2022-06-06T20:25:20Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed
Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$.
We propose a novel robust estimator which does not require $nu_p$ as prior information.
We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z) - Estimation in Tensor Ising Models [5.161531917413708]
We consider the problem of estimating the natural parameter of the $p$-tensor Ising model given a single sample from the distribution on $N$ nodes.
In particular, we show the $sqrt N$-consistency of the MPL estimate in the $p$-spin Sherrington-Kirkpatrick (SK) model.
We derive the precise fluctuations of the MPL estimate in the special case of the $p$-tensor Curie-Weiss model.
arXiv Detail & Related papers (2020-08-29T00:06:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.