Uncertainty-Aware Instance Reweighting for Off-Policy Learning
- URL: http://arxiv.org/abs/2303.06389v2
- Date: Wed, 27 Sep 2023 08:22:47 GMT
- Title: Uncertainty-Aware Instance Reweighting for Off-Policy Learning
- Authors: Xiaoying Zhang, Junpu Chen, Hongning Wang, Hong Xie, Yang Liu, John
C.S. Lui, Hang Li
- Abstract summary: We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
- Score: 63.31923483172859
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Off-policy learning, referring to the procedure of policy optimization with
access only to logged feedback data, has shown importance in various real-world
applications, such as search engines, recommender systems, and etc. While the
ground-truth logging policy, which generates the logged data, is usually
unknown, previous work simply takes its estimated value in off-policy learning,
ignoring both high bias and high variance resulted from such an estimator,
especially on samples with small and inaccurately estimated logging
probabilities. In this work, we explicitly model the uncertainty in the
estimated logging policy and propose a Uncertainty-aware Inverse Propensity
Score estimator (UIPS) for improved off-policy learning, with a theoretical
convergence guarantee. Experiment results on synthetic and three real-world
recommendation datasets demonstrate the advantageous sample efficiency of the
proposed UIPS estimator against an extensive list of state-of-the-art
baselines.
Related papers
- Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Positivity-free Policy Learning with Observational Data [8.293758599118618]
This study introduces a novel positivity-free (stochastic) policy learning framework.
We propose incremental propensity score policies to adjust propensity score values instead of assigning fixed values to treatments.
This paper provides a thorough exploration of the theoretical guarantees associated with policy learning and validates the proposed framework's finite-sample performance.
arXiv Detail & Related papers (2023-10-10T19:47:27Z) - Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old
Data in Nonstationary Environments [31.492146288630515]
We introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias.
We empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.
arXiv Detail & Related papers (2023-02-23T01:17:21Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Variance-Optimal Augmentation Logging for Counterfactual Evaluation in
Contextual Bandits [25.153656462604268]
Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems.
The counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated.
This paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem.
arXiv Detail & Related papers (2022-02-03T17:37:11Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.