A Minimax Learning Approach to Off-Policy Evaluation in Partially
Observable Markov Decision Processes
- URL: http://arxiv.org/abs/2111.06784v1
- Date: Fri, 12 Nov 2021 15:52:24 GMT
- Title: A Minimax Learning Approach to Off-Policy Evaluation in Partially
Observable Markov Decision Processes
- Authors: Chengchun Shi, Masatoshi Uehara and Nan Jiang
- Abstract summary: We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs)
Existing methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces.
We first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution.
- Score: 31.215206208622728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider off-policy evaluation (OPE) in Partially Observable Markov
Decision Processes (POMDPs), where the evaluation policy depends only on
observable variables and the behavior policy depends on unobservable latent
variables. Existing works either assume no unmeasured confounders, or focus on
settings where both the observation and the state spaces are tabular. As such,
these methods suffer from either a large bias in the presence of unmeasured
confounders, or a large variance in settings with continuous or large
observation/state spaces. In this work, we first propose novel identification
methods for OPE in POMDPs with latent confounders, by introducing bridge
functions that link the target policy's value and the observed data
distribution. In fully-observable MDPs, these bridge functions reduce to the
familiar value functions and marginal density ratios between the evaluation and
the behavior policies. We next propose minimax estimation methods for learning
these bridge functions. Our proposal permits general function approximation and
is thus applicable to settings with continuous or large observation/state
spaces. Finally, we construct three estimators based on these estimated bridge
functions, corresponding to a value function-based estimator, a marginalized
importance sampling estimator, and a doubly-robust estimator. Their
nonasymptotic and asymptotic properties are investigated in detail.
Related papers
- RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation [73.2390735383842]
We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions.
We show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm.
These results can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
arXiv Detail & Related papers (2024-06-03T14:51:27Z) - Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study the evaluation of a policy best-parametric and worst-case perturbations to a decision process (MDP)
We use transition observations from the original MDP, whether they are generated under the same or a different policy.
Our estimator is also estimated statistical inference using Wald confidence intervals.
arXiv Detail & Related papers (2024-03-29T18:11:49Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Off-Policy Evaluation for Episodic Partially Observable Markov Decision
Processes under Non-Parametric Models [2.3411358616430435]
We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states.
Motivated by the recently proposed causal inference framework, we develop a non-parametric identification result for estimating the policy value.
This is the first finite-sample error bound for OPE in POMDPs under non-parametric models.
arXiv Detail & Related papers (2022-09-21T01:44:45Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - A Spectral Approach to Off-Policy Evaluation for POMDPs [8.613667867961034]
We consider off-policy evaluation in Partially Observable Markov Decision Processes.
Prior work on this problem uses a causal identification strategy based on one-step observable proxies of the hidden state.
In this work, we relax this requirement by using spectral methods and extending one-step proxies both into the past and future.
arXiv Detail & Related papers (2021-09-22T03:36:51Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Causal Inference Under Unmeasured Confounding With Negative Controls: A
Minimax Learning Approach [84.29777236590674]
We study the estimation of causal parameters when not all confounders are observed and instead negative controls are available.
Recent work has shown how these can enable identification and efficient estimation via two so-called bridge functions.
arXiv Detail & Related papers (2021-03-25T17:59:19Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Neural Methods for Point-wise Dependency Estimation [129.93860669802046]
We focus on estimating point-wise dependency (PD), which quantitatively measures how likely two outcomes co-occur.
We demonstrate the effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation learning, and 3) cross-modal retrieval task.
arXiv Detail & Related papers (2020-06-09T23:26:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.