Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes
- URL: http://arxiv.org/abs/2110.15332v2
- Date: Wed, 22 Mar 2023 22:24:18 GMT
- Title: Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes
- Authors: Andrew Bennett, Nathan Kallus
- Abstract summary: In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
- Score: 65.91730154730905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In applications of offline reinforcement learning to observational data, such
as in healthcare or education, a general concern is that observed actions might
be affected by unobserved factors, inducing confounding and biasing estimates
derived under the assumption of a perfect Markov decision process (MDP) model.
Here we tackle this by considering off-policy evaluation in a partially
observed MDP (POMDP). Specifically, we consider estimating the value of a given
target policy in a POMDP given trajectories with only partial state
observations generated by a different and unknown policy that may depend on the
unobserved state. We tackle two questions: what conditions allow us to identify
the target policy value from the observed data and, given identification, how
to best estimate it. To answer these, we extend the framework of proximal
causal inference to our POMDP setting, providing a variety of settings where
identification is made possible by the existence of so-called bridge functions.
We then show how to construct semiparametrically efficient estimators in these
settings. We term the resulting framework proximal reinforcement learning
(PRL). We demonstrate the benefits of PRL in an extensive simulation study and
on the problem of sepsis management.
Related papers
- RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation [73.2390735383842]
We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions.
We show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm.
These results can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
arXiv Detail & Related papers (2024-06-03T14:51:27Z) - An Instrumental Variable Approach to Confounded Off-Policy Evaluation [11.785128674216903]
Off-policy evaluation (OPE) is a method for estimating the return of a target policy.
This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes.
arXiv Detail & Related papers (2022-12-29T22:06:51Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Reinforcement Learning with Heterogeneous Data: Estimation and Inference [84.72174994749305]
We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity.
We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class.
We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset.
arXiv Detail & Related papers (2022-01-31T20:58:47Z) - Generalizing Off-Policy Evaluation From a Causal Perspective For
Sequential Decision-Making [32.06576007608403]
We argue that explicitly highlighting the association has important implications on our understanding of the fundamental limits of OPE.
We demonstrate how this association motivates natural desiderata to consider a general set of causal estimands.
We discuss each of these aspects as actionable desiderata for future OPE research at scale and in-line with practical utility.
arXiv Detail & Related papers (2022-01-20T16:13:16Z) - A Minimax Learning Approach to Off-Policy Evaluation in Partially
Observable Markov Decision Processes [31.215206208622728]
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs)
Existing methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces.
We first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution.
arXiv Detail & Related papers (2021-11-12T15:52:24Z) - Off-Policy Evaluation in Partially Observed Markov Decision Processes
under Sequential Ignorability [8.388782503421504]
We consider off-policy evaluation of dynamic treatment rules under sequential ignorability.
We show that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes.
arXiv Detail & Related papers (2021-10-24T03:35:23Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.