Minimax Off-Policy Evaluation for Multi-Armed Bandits
- URL: http://arxiv.org/abs/2101.07781v1
- Date: Tue, 19 Jan 2021 18:55:29 GMT
- Title: Minimax Off-Policy Evaluation for Multi-Armed Bandits
- Authors: Cong Ma, Banghua Zhu, Jiantao Jiao, Martin J. Wainwright
- Abstract summary: We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
- Score: 58.7013651350436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of off-policy evaluation in the multi-armed bandit model
with bounded rewards, and develop minimax rate-optimal procedures under three
settings. First, when the behavior policy is known, we show that the Switch
estimator, a method that alternates between the plug-in and importance sampling
estimators, is minimax rate-optimal for all sample sizes. Second, when the
behavior policy is unknown, we analyze performance in terms of the competitive
ratio, thereby revealing a fundamental gap between the settings of known and
unknown behavior policies. When the behavior policy is unknown, any estimator
must have mean-squared error larger -- relative to the oracle estimator
equipped with the knowledge of the behavior policy -- by a multiplicative
factor proportional to the support size of the target policy. Moreover, we
demonstrate that the plug-in approach achieves this worst-case competitive
ratio up to a logarithmic factor. Third, we initiate the study of the partial
knowledge setting in which it is assumed that the minimum probability taken by
the behavior policy is known. We show that the plug-in estimator is optimal for
relatively large values of the minimum probability, but is sub-optimal when the
minimum probability is low. In order to remedy this gap, we propose a new
estimator based on approximation by Chebyshev polynomials that provably
achieves the optimal estimation error. Numerical experiments on both simulated
and real data corroborate our theoretical findings.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy [11.16777821381608]
We introduce a novel doubly-robust (DR) off-policy estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown.
The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the variance of the estimator while considering the estimating effect of the logging policy.
arXiv Detail & Related papers (2024-04-02T10:42:44Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Error Reduction from Stacked Regressions [12.657895453939298]
Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy.
In this paper, we learn these weights analogously by minimizing a regularized version of the empirical risk subject to a nonnegativity constraint.
Thanks to an adaptive shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them.
arXiv Detail & Related papers (2023-09-18T15:42:12Z) - Off-Policy Evaluation of Slate Policies under Bayes Risk [70.10677881866047]
We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate.
We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
arXiv Detail & Related papers (2021-01-05T20:07:56Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.