Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
- URL: http://arxiv.org/abs/2503.01134v1
- Date: Mon, 03 Mar 2025 03:29:05 GMT
- Title: Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
- Authors: Yuheng Zhang, Nan Jiang,
- Abstract summary: We investigate off-policy evaluation (OPE) in Partially Observable Decision Processes (POMDPs) with large observation spaces.<n>We prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings.<n>We show that some hardness can be circumvented by a natural model-based Markov algorithm.
- Score: 11.829110453985228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate off-policy evaluation (OPE), a central and fundamental problem in reinforcement learning (RL), in the challenging setting of Partially Observable Markov Decision Processes (POMDPs) with large observation spaces. Recent works of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-free framework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend on the entire observable history remained an open problem. In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm -- whose analysis has surprisingly eluded the literature despite the algorithm's simplicity -- demonstrating provable separation between model-free and model-based OPE in POMDPs.
Related papers
- Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms [34.593772931446125]
monograph focuses on the exploration of various model-based and model-free approaches for Constrained within the context of average reward Markov Decision Processes (MDPs)
The primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs.
arXiv Detail & Related papers (2024-06-17T12:46:02Z) - RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation [73.2390735383842]
We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions.
We show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm.
These results can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
arXiv Detail & Related papers (2024-06-03T14:51:27Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Bayesian regularization of empirical MDPs [11.3458118258705]
We take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information.
We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store.
arXiv Detail & Related papers (2022-08-03T22:02:50Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Counterfactual Learning of Stochastic Policies with Continuous Actions [42.903292639112536]
We introduce a modelling strategy based on a joint kernel embedding of contexts and actions.<n>We empirically show that the optimization aspect of counterfactual learning is important.<n>We propose an evaluation protocol for offline policies in real-world logged systems.
arXiv Detail & Related papers (2020-04-22T07:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.