Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior
Policies
- URL: http://arxiv.org/abs/2011.14359v1
- Date: Sun, 29 Nov 2020 12:57:54 GMT
- Title: Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior
Policies
- Authors: Jinlin Lai, Lixin Zou, Jiaxing Song
- Abstract summary: Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies.
This paper discusses how to correctly mix estimators produced by different behavior policies.
Experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.
- Score: 3.855085732184416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation is a key component of reinforcement learning which
evaluates a target policy with offline data collected from behavior policies.
It is a crucial step towards safe reinforcement learning and has been used in
advertisement, recommender systems and many other applications. In these
applications, sometimes the offline data is collected from multiple behavior
policies. Previous works regard data from different behavior policies equally.
Nevertheless, some behavior policies are better at producing good estimators
while others are not. This paper starts with discussing how to correctly mix
estimators produced by different behavior policies. We propose three ways to
reduce the variance of the mixture estimator when all sub-estimators are
unbiased or asymptotically unbiased. Furthermore, experiments on simulated
recommender systems show that our methods are effective in reducing the
Mean-Square Error of estimation.
Related papers
- $Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies [13.528097424046823]
We introduce $Deltatext-rm OPE$ methods based on the widely used Inverse Propensity Scoring estimator.
Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
arXiv Detail & Related papers (2024-05-16T12:04:55Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Multi-Objective Recommendation via Multivariate Policy Learning [10.494676556696213]
Real-world recommender systems often need to balance multiple objectives when deciding which recommendations to present to users.
These include behavioural signals (e.g. clicks, shares, dwell time), as well as broader objectives (e.g. diversity, fairness)
arXiv Detail & Related papers (2024-05-03T14:44:04Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design [18.326126953667842]
We propose novel methods that improve the data efficiency of online Monte Carlo estimators.
We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator.
We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data.
arXiv Detail & Related papers (2023-01-31T16:12:31Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Case-based off-policy policy evaluation using prototype learning [8.550140109387467]
We propose estimating the behavior policy for off-policy policy evaluation using prototype learning.
We show how the prototypes give a condensed summary of differences between the target and behavior policies.
We also describe estimated values in terms of the prototypes to better understand which parts of the target policies have the most impact on the estimates.
arXiv Detail & Related papers (2021-11-22T11:03:45Z) - Sayer: Using Implicit Feedback to Optimize System Policies [63.992191765269396]
We develop a methodology that leverages implicit feedback to evaluate and train new system policies.
Sayer builds on two ideas from reinforcement learning to leverage data collected by an existing policy.
We show that Sayer can evaluate arbitrary policies accurately, and train new policies that outperform the production policies.
arXiv Detail & Related papers (2021-10-28T04:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.