Offline Policy Comparison under Limited Historical Agent-Environment
Interactions
- URL: http://arxiv.org/abs/2106.03934v1
- Date: Mon, 7 Jun 2021 19:51:00 GMT
- Title: Offline Policy Comparison under Limited Historical Agent-Environment
Interactions
- Authors: Anton Dereventsov and Joseph D. Daws Jr. and Clayton Webster
- Abstract summary: We address the challenge of policy evaluation in real-world applications of reinforcement learning systems.
We propose that one should perform policy comparison, i.e. to rank the policies of interest in terms of their value based on available historical data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the challenge of policy evaluation in real-world applications of
reinforcement learning systems where the available historical data is limited
due to ethical, practical, or security considerations. This constrained
distribution of data samples often leads to biased policy evaluation estimates.
To remedy this, we propose that instead of policy evaluation, one should
perform policy comparison, i.e. to rank the policies of interest in terms of
their value based on available historical data. In addition we present the
Limited Data Estimator (LDE) as a simple method for evaluating and comparing
policies from a small number of interactions with the environment. According to
our theoretical analysis, the LDE is shown to be statistically reliable on
policy comparison tasks under mild assumptions on the distribution of the
historical data. Additionally, our numerical experiments compare the LDE to
other policy evaluation methods on the task of policy ranking and demonstrate
its advantage in various settings.
Related papers
- Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Offline Policy Comparison with Confidence: Benchmarks and Baselines [28.775565917880915]
We create benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning.
We also present an empirical evaluation of the risk versus coverage trade-off for a class of model-based baselines.
arXiv Detail & Related papers (2022-05-22T04:28:25Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset.
Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics.
We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z) - A Practical Guide of Off-Policy Evaluation for Bandit Problems [13.607327477092877]
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies.
We propose a meta-algorithm based on existing OPE estimators.
We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
arXiv Detail & Related papers (2020-10-23T15:11:19Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Off-Policy Evaluation and Learning for External Validity under a
Covariate Shift [32.37842308026544]
We consider evaluating and training a new policy for the evaluation data by using the historical data obtained from a different policy.
The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data.
arXiv Detail & Related papers (2020-02-26T17:18:43Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.