Offline Policy Selection under Uncertainty
- URL: http://arxiv.org/abs/2012.06919v1
- Date: Sat, 12 Dec 2020 23:09:21 GMT
- Title: Offline Policy Selection under Uncertainty
- Authors: Mengjiao Yang, Bo Dai, Ofir Nachum, George Tucker, Dale Schuurmans
- Abstract summary: We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset.
Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics.
We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
- Score: 113.57441913299868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The presence of uncertainty in policy evaluation significantly complicates
the process of policy ranking and selection in real-world settings. We formally
consider offline policy selection as learning preferences over a set of policy
prospects given a fixed experience dataset. While one can select or rank
policies based on point estimates of their policy values or high-confidence
intervals, access to the full distribution over one's belief of the policy
value enables more flexible selection algorithms under a wider range of
downstream evaluation metrics. We propose BayesDICE for estimating this belief
distribution in terms of posteriors of distribution correction ratios derived
from stochastic constraints (as opposed to explicit likelihood, which is not
available). Empirically, BayesDICE is highly competitive to existing
state-of-the-art approaches in confidence interval estimation. More
importantly, we show how the belief distribution estimated by BayesDICE may be
used to rank policies with respect to any arbitrary downstream policy selection
metric, and we empirically demonstrate that this selection procedure
significantly outperforms existing approaches, such as ranking policies
according to mean or high-confidence lower bound value estimates.
Related papers
- Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Identification of Subgroups With Similar Benefits in Off-Policy Policy
Evaluation [60.71312668265873]
We develop a method to balance the need for personalization with confident predictions.
We show that our method can be used to form accurate predictions of heterogeneous treatment effects.
arXiv Detail & Related papers (2021-11-28T23:19:12Z) - Safe Policy Learning through Extrapolation: Application to Pre-trial
Risk Assessment [0.0]
We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy.
We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations.
We derive new classification and recommendation rules that retain the transparency and interpretability of the existing risk assessment instrument.
arXiv Detail & Related papers (2021-09-22T00:52:03Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Confident Off-Policy Evaluation and Selection through Self-Normalized
Importance Weighting [15.985182419152197]
We propose a new method to compute a lower bound on the value of an arbitrary target policy.
The new approach is evaluated on a number of synthetic and real datasets and is found to be superior to its main competitors.
arXiv Detail & Related papers (2020-06-18T12:15:37Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.