Case-based off-policy policy evaluation using prototype learning
- URL: http://arxiv.org/abs/2111.11113v1
- Date: Mon, 22 Nov 2021 11:03:45 GMT
- Title: Case-based off-policy policy evaluation using prototype learning
- Authors: Anton Matsson, Fredrik D. Johansson
- Abstract summary: We propose estimating the behavior policy for off-policy policy evaluation using prototype learning.
We show how the prototypes give a condensed summary of differences between the target and behavior policies.
We also describe estimated values in terms of the prototypes to better understand which parts of the target policies have the most impact on the estimates.
- Score: 8.550140109387467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Importance sampling (IS) is often used to perform off-policy policy
evaluation but is prone to several issues, especially when the behavior policy
is unknown and must be estimated from data. Significant differences between the
target and behavior policies can result in uncertain value estimates due to,
for example, high variance and non-evaluated actions. If the behavior policy is
estimated using black-box models, it can be hard to diagnose potential problems
and to determine for which inputs the policies differ in their suggested
actions and resulting values. To address this, we propose estimating the
behavior policy for IS using prototype learning. We apply this approach in the
evaluation of policies for sepsis treatment, demonstrating how the prototypes
give a condensed summary of differences between the target and behavior
policies while retaining an accuracy comparable to baseline estimators. We also
describe estimated values in terms of the prototypes to better understand which
parts of the target policies have the most impact on the estimates. Using a
simulator, we study the bias resulting from restricting models to use
prototypes.
Related papers
- Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies [24.706986328622193]
We consider off-policy evaluation of deterministic target policies for reinforcement learning.
We learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function.
We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric.
arXiv Detail & Related papers (2024-05-29T06:17:33Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Identification of Subgroups With Similar Benefits in Off-Policy Policy
Evaluation [60.71312668265873]
We develop a method to balance the need for personalization with confident predictions.
We show that our method can be used to form accurate predictions of heterogeneous treatment effects.
arXiv Detail & Related papers (2021-11-28T23:19:12Z) - Off-Policy Evaluation via Adaptive Weighting with Data from Contextual
Bandits [5.144809478361604]
We improve the doubly robust (DR) estimator by adaptively weighting observations to control its variance.
We provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
arXiv Detail & Related papers (2021-06-03T17:54:44Z) - Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior
Policies [3.855085732184416]
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies.
This paper discusses how to correctly mix estimators produced by different behavior policies.
Experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.
arXiv Detail & Related papers (2020-11-29T12:57:54Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.