Quantile Off-Policy Evaluation via Deep Conditional Generative Learning
- URL: http://arxiv.org/abs/2212.14466v1
- Date: Thu, 29 Dec 2022 22:01:43 GMT
- Title: Quantile Off-Policy Evaluation via Deep Conditional Generative Learning
- Authors: Yang Xu, Chengchun Shi, Shikai Luo, Lan Wang, and Rui Song
- Abstract summary: Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
- Score: 21.448553360543478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-Policy evaluation (OPE) is concerned with evaluating a new target policy
using offline data generated by a potentially different behavior policy. It is
critical in a number of sequential decision making problems ranging from
healthcare to technology industries. Most of the work in existing literature is
focused on evaluating the mean outcome of a given policy, and ignores the
variability of the outcome. However, in a variety of applications, criteria
other than the mean may be more sensible. For example, when the reward
distribution is skewed and asymmetric, quantile-based metrics are often
preferred for their robustness. In this paper, we propose a doubly-robust
inference procedure for quantile OPE in sequential decision making and study
its asymptotic properties. In particular, we propose utilizing state-of-the-art
deep conditional generative learning methods to handle parameter-dependent
nuisance function estimation. We demonstrate the advantages of this proposed
estimator through both simulations and a real-world dataset from a short-video
platform. In particular, we find that our proposed estimator outperforms
classical OPE estimators for the mean in settings with heavy-tailed reward
distributions.
Related papers
- Automated Off-Policy Estimator Selection via Supervised Learning [7.476028372444458]
Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one.
To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy.
We propose an automated data-driven OPE estimator selection method based on supervised learning.
arXiv Detail & Related papers (2024-06-26T02:34:48Z) - OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Deeply-Debiased Off-Policy Interval Estimation [11.683223078990325]
Off-policy evaluation learns a target policy's value with a historical dataset generated by a different behavior policy.
Many applications would benefit significantly from having a confidence interval (CI) that quantifies the uncertainty of the point estimate.
We propose a novel procedure to construct an efficient, robust, and flexible CI on a target policy's value.
arXiv Detail & Related papers (2021-05-10T20:00:08Z) - Universal Off-Policy Evaluation [64.02853483874334]
We take the first steps towards a universal off-policy estimator (UnO)
We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns.
arXiv Detail & Related papers (2021-04-26T18:54:31Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Statistical Inference of the Value Function for Reinforcement Learning
in Infinite Horizon Settings [0.0]
We construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity.
We show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique.
We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status.
arXiv Detail & Related papers (2020-01-13T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.