Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy
- URL: http://arxiv.org/abs/2010.13554v1
- Date: Fri, 23 Oct 2020 15:22:57 GMT
- Title: Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy
- Authors: Masahiro Kato and Yusuke Kaneko
- Abstract summary: The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
- Score: 8.807587076209566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of off-policy evaluation (OPE) is to evaluate a new policy using
historical data obtained via a behavior policy. However, because the contextual
bandit algorithm updates the policy based on past observations, the samples are
not independent and identically distributed (i.i.d.). This paper tackles this
problem by constructing an estimator from a martingale difference sequence
(MDS) for the dependent samples. In the data-generating process, we do not
assume the convergence of the policy, but the policy uses the same conditional
probability of choosing an action during a certain period. Then, we derive an
asymptotically normal estimator of the value of an evaluation policy. As
another advantage of our method, the batch-based approach simultaneously solves
the deficient support problem. Using benchmark and real-world datasets, we
experimentally confirm the effectiveness of the proposed method.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Sayer: Using Implicit Feedback to Optimize System Policies [63.992191765269396]
We develop a methodology that leverages implicit feedback to evaluate and train new system policies.
Sayer builds on two ideas from reinforcement learning to leverage data collected by an existing policy.
We show that Sayer can evaluate arbitrary policies accurately, and train new policies that outperform the production policies.
arXiv Detail & Related papers (2021-10-28T04:16:56Z) - Confidence Interval for Off-Policy Evaluation from Dependent Samples via
Bandit Algorithm: Approach from Standardized Martingales [8.807587076209566]
The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm.
Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.)
Several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d.
arXiv Detail & Related papers (2020-06-12T07:48:04Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.