Generalizing Off-Policy Learning under Sample Selection Bias
- URL: http://arxiv.org/abs/2112.01387v1
- Date: Thu, 2 Dec 2021 16:18:16 GMT
- Title: Generalizing Off-Policy Learning under Sample Selection Bias
- Authors: Tobias Hatt, Daniel Tschernutter, Stefan Feuerriegel
- Abstract summary: We propose a novel framework for learning policies that generalize to the target population.
We prove that, if the uncertainty set is well-specified, our policies generalize to the target population as they can not do worse than on the training data.
- Score: 15.733136147164032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning personalized decision policies that generalize to the target
population is of great relevance. Since training data is often not
representative of the target population, standard policy learning methods may
yield policies that do not generalize target population. To address this
challenge, we propose a novel framework for learning policies that generalize
to the target population. For this, we characterize the difference between the
training data and the target population as a sample selection bias using a
selection variable. Over an uncertainty set around this selection variable, we
optimize the minimax value of a policy to achieve the best worst-case policy
value on the target population. In order to solve the minimax problem, we
derive an efficient algorithm based on a convex-concave procedure and prove
convergence for parametrized spaces of policies such as logistic policies. We
prove that, if the uncertainty set is well-specified, our policies generalize
to the target population as they can not do worse than on the training data.
Using simulated data and a clinical trial, we demonstrate that, compared to
standard policy learning methods, our framework improves the generalizability
of policies substantially.
Related papers
- Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Local Policy Improvement for Recommender Systems [8.617221361305901]
We show how to train a new policy given data collected from a previously-deployed policy.
We suggest an alternative approach of local policy improvement without off-policy correction.
This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently.
arXiv Detail & Related papers (2022-12-22T00:47:40Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Robust Batch Policy Learning in Markov Decision Processes [0.0]
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
arXiv Detail & Related papers (2020-11-09T04:41:21Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.