Related papers: Minimax Off-Policy Evaluation for Multi-Armed Bandits

Minimax Off-Policy Evaluation for Multi-Armed Bandits

URL: http://arxiv.org/abs/2101.07781v1
Date: Tue, 19 Jan 2021 18:55:29 GMT
Title: Minimax Off-Policy Evaluation for Multi-Armed Bandits
Authors: Cong Ma, Banghua Zhu, Jiantao Jiao, Martin J. Wainwright
Abstract summary: We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards. We develop minimax rate-optimal procedures under three settings.
Score: 58.7013651350436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger -- relative to the oracle estimator equipped with the knowledge of the behavior policy -- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.

Related papers

Minimax Optimal Two-Stage Algorithm For Moment Estimation Under Covariate Shift [10.35788775775647]
We investigate the minimax lower bound of the problem when the source and target distributions are known.<n>Specifically, it first trains an optimal estimator for the function under the source distribution, and then uses a likelihood ratio reweighting procedure to calibrate the moment estimator.<n>To solve this problem, we propose a truncated version of the estimator that ensures double robustness and provide the corresponding upper bound.
arXiv Detail & Related papers (2025-06-30T01:32:36Z)
Adversarial learning for nonparametric regression: Minimax rate and adaptive estimation [3.244945627960733]
We establish the minimax rate of convergence under adversarial $L_risks with $1 leq leq infty$ and propose a piecewise local estimator that achieves the minimax optimality.<n>We construct a data-driven adaptive estimator that is shown to achieve, within a logarithmic factor, the optimal rate across a broad scale of non and adversarial classes.
arXiv Detail & Related papers (2025-06-02T02:38:47Z)
Efficient and Sharp Off-Policy Learning under Unobserved Confounding [25.068617118126824]
We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Our method is highly relevant for decision-making where unobserved confounding can be problematic.
arXiv Detail & Related papers (2025-02-18T16:42:24Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy [11.16777821381608]
We introduce a novel doubly-robust (DR) off-policy estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the variance of the estimator while considering the estimating effect of the logging policy.
arXiv Detail & Related papers (2024-04-02T10:42:44Z)
Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems. The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy. We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z)
Error Reduction from Stacked Regressions [12.657895453939298]
Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. In this paper, we learn these weights analogously by minimizing a regularized version of the empirical risk subject to a nonnegativity constraint. Thanks to an adaptive shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them.
arXiv Detail & Related papers (2023-09-18T15:42:12Z)
Off-Policy Evaluation of Slate Policies under Bayes Risk [70.10677881866047]
We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate. We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
arXiv Detail & Related papers (2021-01-05T20:07:56Z)
Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.