Wasserstein Distributionally Robust Policy Evaluation and Learning for
Contextual Bandits
- URL: http://arxiv.org/abs/2309.08748v3
- Date: Wed, 17 Jan 2024 14:07:16 GMT
- Title: Wasserstein Distributionally Robust Policy Evaluation and Learning for
Contextual Bandits
- Authors: Yi Shen, Pan Xu, Michael M. Zavlanos
- Abstract summary: Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment.
To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed.
We propose a novel DRO approach that employs the Wasserstein distance instead.
- Score: 18.982448033389588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy evaluation and learning are concerned with assessing a given
policy and learning an optimal policy from offline data without direct
interaction with the environment. Often, the environment in which the data are
collected differs from the environment in which the learned policy is applied.
To account for the effect of different environments during learning and
execution, distributionally robust optimization (DRO) methods have been
developed that compute worst-case bounds on the policy values assuming that the
distribution of the new environment lies within an uncertainty set. Typically,
this uncertainty set is defined based on the KL divergence around the empirical
distribution computed from the logging dataset. However, the KL uncertainty set
fails to encompass distributions with varying support and lacks awareness of
the geometry of the distribution support. As a result, KL approaches fall short
in addressing practical environment mismatches and lead to over-fitting to
worst-case scenarios. To overcome these limitations, we propose a novel DRO
approach that employs the Wasserstein distance instead. While Wasserstein DRO
is generally computationally more expensive compared to KL DRO, we present a
regularized method and a practical (biased) stochastic gradient descent method
to optimize the policy efficiently. We also provide a theoretical analysis of
the finite sample complexity and iteration complexity for our proposed method.
We further validate our approach using a public dataset that was recorded in a
randomized stoke trial.
Related papers
- Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.
We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.
We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Grounding Aleatoric Uncertainty in Unsupervised Environment Design [32.00797965770773]
In partially-observable settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment.
We propose a minimax regret UED method that optimize the ground-truth utility function, even when the underlying training data is biased due to CICS.
arXiv Detail & Related papers (2022-07-11T22:45:29Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.