Distributionally Robust Batch Contextual Bandits
- URL: http://arxiv.org/abs/2006.05630v7
- Date: Mon, 11 Sep 2023 20:54:49 GMT
- Title: Distributionally Robust Batch Contextual Bandits
- Authors: Nian Si, Fan Zhang, Zhengyuan Zhou, Jose Blanchet
- Abstract summary: Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
- Score: 20.667213458836734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy learning using historical observational data is an important problem
that has found widespread applications. Examples include selecting offers,
prices, advertisements to send to customers, as well as selecting which
medication to prescribe to a patient. However, existing literature rests on the
crucial assumption that the future environment where the learned policy will be
deployed is the same as the past environment that has generated the data -- an
assumption that is often false or too coarse an approximation. In this paper,
we lift this assumption and aim to learn a distributionally robust policy with
incomplete observational data. We first present a policy evaluation procedure
that allows us to assess how well the policy does under the worst-case
environment shift. We then establish a central limit theorem type guarantee for
this proposed policy evaluation scheme. Leveraging this evaluation scheme, we
further propose a novel learning algorithm that is able to learn a policy that
is robust to adversarial perturbations and unknown covariate shifts with a
performance guarantee based on the theory of uniform convergence. Finally, we
empirically test the effectiveness of our proposed algorithm in synthetic
datasets and demonstrate that it provides the robustness that is missing using
standard policy learning algorithms. We conclude the paper by providing a
comprehensive application of our methods in the context of a real-world voting
dataset.
Related papers
- Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Counterfactual Learning with General Data-generating Policies [3.441021278275805]
We develop an OPE method for a class of full support and deficient support logging policies in contextual-bandit settings.
We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases.
arXiv Detail & Related papers (2022-12-04T21:07:46Z) - Conformal Off-Policy Prediction in Contextual Bandits [54.67508891852636]
Conformal off-policy prediction can output reliable predictive intervals for the outcome under a new target policy.
We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup.
arXiv Detail & Related papers (2022-06-09T10:39:33Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z) - A Practical Guide of Off-Policy Evaluation for Bandit Problems [13.607327477092877]
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies.
We propose a meta-algorithm based on existing OPE estimators.
We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
arXiv Detail & Related papers (2020-10-23T15:11:19Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.