Anytime-valid off-policy inference for contextual bandits
- URL: http://arxiv.org/abs/2210.10768v1
- Date: Wed, 19 Oct 2022 17:57:53 GMT
- Title: Anytime-valid off-policy inference for contextual bandits
- Authors: Ian Waudby-Smith, Lili Wu, Aaditya Ramdas, Nikos Karampatziakis, and
Paul Mineiro
- Abstract summary: "Off-policy evaluation" is a problem known as "off-policy evaluation" (OPE)
We present a comprehensive framework for OPE inference that relax many unnecessary assumptions made in past work.
We derive confidence sequences for various functionals of interest in OPE.
- Score: 35.92569763743846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual bandits are a modern staple tool for active sequential
experimentation in the tech industry. They involve online learning algorithms
that adaptively (over time) learn policies to map observed contexts $X_t$ to
actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This
adaptivity raises interesting but hard statistical inference questions,
especially counterfactual ones: for example, it is often of interest to
estimate the properties of a hypothetical policy that is different from the
logging policy that was used to collect the data -- a problem known as
"off-policy evaluation" (OPE). Using modern martingale techniques, we present a
comprehensive framework for OPE inference that relax many unnecessary
assumptions made in past work, significantly improving on them theoretically
and empirically. Our methods remain valid in very general settings, and can be
employed while the original experiment is still running (that is, not
necessarily post-hoc), when the logging policy may be itself changing (due to
learning), and even if the context distributions are drifting over time. More
concretely, we derive confidence sequences for various functionals of interest
in OPE. These include doubly robust ones for time-varying off-policy mean
reward values, but also confidence bands for the entire CDF of the off-policy
reward distribution. All of our methods (a) are valid at arbitrary stopping
times (b) only make nonparametric assumptions, and (c) do not require known
bounds on the maximal importance weights, and (d) adapt to the empirical
variance of the reward and weight distributions. In summary, our methods enable
anytime-valid off-policy inference using adaptively collected contextual bandit
data.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Chaining Value Functions for Off-Policy Learning [22.54793586116019]
We discuss a novel family of off-policy prediction algorithms which are convergent by construction.
We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix.
Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
arXiv Detail & Related papers (2022-01-17T15:26:47Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.