Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
- URL: http://arxiv.org/abs/2207.13081v2
- Date: Tue, 14 Nov 2023 22:16:28 GMT
- Title: Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
- Authors: Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor
Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun
- Abstract summary: We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation.
We develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs.
We extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.
- Score: 67.21319339512699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs)
with general function approximation. Existing methods such as sequential
importance sampling estimators and fitted-Q evaluation suffer from the curse of
horizon in POMDPs. To circumvent this problem, we develop a novel model-free
OPE method by introducing future-dependent value functions that take future
proxies as inputs. Future-dependent value functions play similar roles as
classical value functions in fully-observable MDPs. We derive a new Bellman
equation for future-dependent value functions as conditional moment equations
that use history proxies as instrumental variables. We further propose a
minimax learning method to learn future-dependent value functions using the new
Bellman equation. We obtain the PAC result, which implies our OPE estimator is
consistent as long as futures and histories contain sufficient information
about latent states, and the Bellman completeness. Finally, we extend our
methods to learning of dynamics and establish the connection between our
approach and the well-known spectral learning methods in POMDPs.
Related papers
- On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation [11.829110453985228]
We develop estimators whose guarantee avoids exponential dependence on the horizon.
In this paper, we discover novel coverage assumptions tailored to the structure of POMDPs.
As a side product, our analyses also lead to the discovery of new algorithms with complementary properties.
arXiv Detail & Related papers (2024-02-22T17:00:50Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Provably Efficient Reinforcement Learning in Partially Observable
Dynamical Systems [97.12538243736705]
We study Reinforcement Learning for partially observable dynamical systems using function approximation.
We propose a new textitPartially Observable Bilinear Actor-Critic framework, that is general enough to include models such as POMDPs, observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition.
arXiv Detail & Related papers (2022-06-24T00:27:42Z) - A Minimax Learning Approach to Off-Policy Evaluation in Partially
Observable Markov Decision Processes [31.215206208622728]
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs)
Existing methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces.
We first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution.
arXiv Detail & Related papers (2021-11-12T15:52:24Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - A maximum-entropy approach to off-policy evaluation in average-reward
MDPs [54.967872716145656]
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs)
We provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases.
We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning.
arXiv Detail & Related papers (2020-06-17T18:13:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.