A Spectral Approach to Off-Policy Evaluation for POMDPs
- URL: http://arxiv.org/abs/2109.10502v1
- Date: Wed, 22 Sep 2021 03:36:51 GMT
- Title: A Spectral Approach to Off-Policy Evaluation for POMDPs
- Authors: Yash Nair and Nan Jiang
- Abstract summary: We consider off-policy evaluation in Partially Observable Markov Decision Processes.
Prior work on this problem uses a causal identification strategy based on one-step observable proxies of the hidden state.
In this work, we relax this requirement by using spectral methods and extending one-step proxies both into the past and future.
- Score: 8.613667867961034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider off-policy evaluation (OPE) in Partially Observable Markov
Decision Processes, where the evaluation policy depends only on observable
variables but the behavior policy depends on latent states (Tennenholtz et al.
(2020a)). Prior work on this problem uses a causal identification strategy
based on one-step observable proxies of the hidden state, which relies on the
invertibility of certain one-step moment matrices. In this work, we relax this
requirement by using spectral methods and extending one-step proxies both into
the past and future. We empirically compare our OPE methods to existing ones
and demonstrate their improved prediction accuracy and greater generality.
Lastly, we derive a separate Importance Sampling (IS) algorithm which relies on
rank, distinctness, and positivity conditions, and not on the strict
sufficiency conditions of observable trajectories with respect to the reward
and hidden-state structure required by Tennenholtz et al. (2020a).
Related papers
- Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Provably Efficient Reinforcement Learning in Partially Observable
Dynamical Systems [97.12538243736705]
We study Reinforcement Learning for partially observable dynamical systems using function approximation.
We propose a new textitPartially Observable Bilinear Actor-Critic framework, that is general enough to include models such as POMDPs, observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition.
arXiv Detail & Related papers (2022-06-24T00:27:42Z) - A Minimax Learning Approach to Off-Policy Evaluation in Partially
Observable Markov Decision Processes [31.215206208622728]
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs)
Existing methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces.
We first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution.
arXiv Detail & Related papers (2021-11-12T15:52:24Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Off-Policy Evaluation in Partially Observed Markov Decision Processes
under Sequential Ignorability [8.388782503421504]
We consider off-policy evaluation of dynamic treatment rules under sequential ignorability.
We show that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes.
arXiv Detail & Related papers (2021-10-24T03:35:23Z) - Universal Off-Policy Evaluation [64.02853483874334]
We take the first steps towards a universal off-policy estimator (UnO)
We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns.
arXiv Detail & Related papers (2021-04-26T18:54:31Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.