Off-Policy Evaluation for Episodic Partially Observable Markov Decision
Processes under Non-Parametric Models
- URL: http://arxiv.org/abs/2209.10064v1
- Date: Wed, 21 Sep 2022 01:44:45 GMT
- Title: Off-Policy Evaluation for Episodic Partially Observable Markov Decision
Processes under Non-Parametric Models
- Authors: Rui Miao, Zhengling Qi, Xiaoke Zhang
- Abstract summary: We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states.
Motivated by the recently proposed causal inference framework, we develop a non-parametric identification result for estimating the policy value.
This is the first finite-sample error bound for OPE in POMDPs under non-parametric models.
- Score: 2.3411358616430435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of off-policy evaluation (OPE) for episodic Partially
Observable Markov Decision Processes (POMDPs) with continuous states. Motivated
by the recently proposed proximal causal inference framework, we develop a
non-parametric identification result for estimating the policy value via a
sequence of so-called V-bridge functions with the help of time-dependent proxy
variables. We then develop a fitted-Q-evaluation-type algorithm to estimate
V-bridge functions recursively, where a non-parametric instrumental variable
(NPIV) problem is solved at each step. By analyzing this challenging sequential
NPIV problem, we establish the finite-sample error bounds for estimating the
V-bridge functions and accordingly that for evaluating the policy value, in
terms of the sample size, length of horizon and so-called (local) measure of
ill-posedness at each step. To the best of our knowledge, this is the first
finite-sample error bound for OPE in POMDPs under non-parametric models.
Related papers
- Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study the evaluation of a policy best-parametric and worst-case perturbations to a decision process (MDP)
We use transition observations from the original MDP, whether they are generated under the same or a different policy.
Our estimator is also estimated statistical inference using Wald confidence intervals.
arXiv Detail & Related papers (2024-03-29T18:11:49Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - A Minimax Learning Approach to Off-Policy Evaluation in Partially
Observable Markov Decision Processes [31.215206208622728]
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs)
Existing methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces.
We first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution.
arXiv Detail & Related papers (2021-11-12T15:52:24Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Nonparametric estimation of continuous DPPs with kernel methods [0.0]
Parametric and nonparametric inference methods have been proposed in the finite case, i.e. when the point patterns live in a finite ground set.
We show that a restricted version of this maximum likelihood (MLE) problem falls within the scope of a recent representer theorem for nonnegative functions in an RKHS.
We propose, analyze, and demonstrate a fixed point algorithm to solve this finite-dimensional problem.
arXiv Detail & Related papers (2021-06-27T11:57:14Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Identification of Unexpected Decisions in Partially Observable
Monte-Carlo Planning: a Rule-Based Approach [78.05638156687343]
We propose a methodology for analyzing POMCP policies by inspecting their traces.
The proposed method explores local properties of policy behavior to identify unexpected decisions.
We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to mobile robot navigation.
arXiv Detail & Related papers (2020-12-23T15:09:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.