A Practical Guide of Off-Policy Evaluation for Bandit Problems
- URL: http://arxiv.org/abs/2010.12470v1
- Date: Fri, 23 Oct 2020 15:11:19 GMT
- Title: A Practical Guide of Off-Policy Evaluation for Bandit Problems
- Authors: Masahiro Kato, Kenshi Abe, Kaito Ariu, Shota Yasui
- Abstract summary: Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies.
We propose a meta-algorithm based on existing OPE estimators.
We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
- Score: 13.607327477092877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation (OPE) is the problem of estimating the value of a
target policy from samples obtained via different policies. Recently, applying
OPE methods for bandit problems has garnered attention. For the theoretical
guarantees of an estimator of the policy value, the OPE methods require various
conditions on the target policy and policy used for generating the samples.
However, existing studies did not carefully discuss the practical situation
where such conditions hold, and the gap between them remains. This paper aims
to show new results for bridging the gap. Based on the properties of the
evaluation policy, we categorize OPE situations. Then, among practical
applications, we mainly discuss the best policy selection. For the situation,
we propose a meta-algorithm based on existing OPE estimators. We investigate
the proposed concepts using synthetic and open real-world datasets in
experiments.
Related papers
- Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems.
Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process.
This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z) - Efficient Multi-Policy Evaluation for Reinforcement Learning [25.83084281519926]
We design a tailored behavior policy to reduce the variance of estimators across all target policies.
We show our estimator has a substantially lower variance compared with previous best methods.
arXiv Detail & Related papers (2024-08-16T12:33:40Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Identification of Subgroups With Similar Benefits in Off-Policy Policy
Evaluation [60.71312668265873]
We develop a method to balance the need for personalization with confident predictions.
We show that our method can be used to form accurate predictions of heterogeneous treatment effects.
arXiv Detail & Related papers (2021-11-28T23:19:12Z) - Active Offline Policy Selection [19.18251239758809]
This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget.
Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data.
We introduce a novel emphactive offline policy selection problem formulation, which combined logged data and limited online interactions to identify the best policy.
arXiv Detail & Related papers (2021-06-18T17:33:13Z) - Offline Policy Comparison under Limited Historical Agent-Environment
Interactions [0.0]
We address the challenge of policy evaluation in real-world applications of reinforcement learning systems.
We propose that one should perform policy comparison, i.e. to rank the policies of interest in terms of their value based on available historical data.
arXiv Detail & Related papers (2021-06-07T19:51:00Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.