Near Optimality of Finite Memory Feedback Policies in Partially Observed
Markov Decision Processes
- URL: http://arxiv.org/abs/2010.07452v2
- Date: Sat, 8 Jan 2022 13:31:12 GMT
- Title: Near Optimality of Finite Memory Feedback Policies in Partially Observed
Markov Decision Processes
- Authors: Ali Devran Kara and Serdar Yuksel
- Abstract summary: We study a planning problem for POMDPs where the system dynamics and measurement channel model is assumed to be known.
We find optimal policies for the approximate belief model under mild non-linear filter stability conditions.
We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the theory of Partially Observed Markov Decision Processes (POMDPs),
existence of optimal policies have in general been established via converting
the original partially observed stochastic control problem to a fully observed
one on the belief space, leading to a belief-MDP. However, computing an optimal
policy for this fully observed model, and so for the original POMDP, using
classical dynamic or linear programming methods is challenging even if the
original system has finite state and action spaces, since the state space of
the fully observed belief-MDP model is always uncountable. Furthermore, there
exist very few rigorous value function approximation and optimal policy
approximation results, as regularity conditions needed often require a tedious
study involving the spaces of probability measures leading to properties such
as Feller continuity. In this paper, we study a planning problem for POMDPs
where the system dynamics and measurement channel model are assumed to be
known. We construct an approximate belief model by discretizing the belief
space using only finite window information variables. We then find optimal
policies for the approximate model and we rigorously establish near optimality
of the constructed finite window control policies in POMDPs under mild
non-linear filter stability conditions and the assumption that the measurement
and action sets are finite (and the state space is real vector valued). We also
establish a rate of convergence result which relates the finite window memory
size and the approximation error bound, where the rate of convergence is
exponential under explicit and testable exponential filter stability
conditions. While there exist many experimental results and few rigorous
asymptotic convergence results, an explicit rate of convergence result is new
in the literature, to our knowledge.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Strongly-polynomial time and validation analysis of policy gradient methods [3.722665817361884]
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL)
By incorporating this advantage gap function into the design of step size rules, we deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy.
This is the first time that such strong convergence properties have been established for policy gradient methods.
arXiv Detail & Related papers (2024-09-28T18:56:48Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Double Duality: Variational Primal-Dual Policy Optimization for
Constrained Reinforcement Learning [132.7040981721302]
We study the Constrained Convex Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure.
Design algorithms for a constrained convex MDP faces several challenges, including handling the large state space.
arXiv Detail & Related papers (2024-02-16T16:35:18Z) - Online POMDP Planning with Anytime Deterministic Guarantees [11.157761902108692]
Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs)
Finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks.
We derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one.
arXiv Detail & Related papers (2023-10-03T04:40:38Z) - Optimal Learning via Moderate Deviations Theory [4.6930976245638245]
We develop a systematic construction of highly accurate confidence intervals by using a moderate deviation principle-based approach.
It is shown that the proposed confidence intervals are statistically optimal in the sense that they satisfy criteria regarding exponential accuracy, minimality, consistency, mischaracterization probability, and eventual uniformly most accurate (UMA) property.
arXiv Detail & Related papers (2023-05-23T19:57:57Z) - Computationally Efficient PAC RL in POMDPs with Latent Determinism and
Conditional Embeddings [97.12538243736705]
We study reinforcement learning with function approximation for large-scale Partially Observable Decision Processes (POMDPs)
Our algorithm provably scales to large-scale POMDPs.
arXiv Detail & Related papers (2022-06-24T05:13:35Z) - Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs [21.347689976296834]
We employ the natural policy gradient method to solve the discounted optimal optimal rate problem.
We also provide convergence and finite-sample guarantees for two sample-based NPG-PD algorithms.
arXiv Detail & Related papers (2022-06-06T04:28:04Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality
of Learned Policies under Filter Stability [0.0]
For POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions.
We present explicit error bounds relating the approximation error to the length of the finite history window.
We show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP.
arXiv Detail & Related papers (2021-03-22T20:14:26Z) - Amortized Conditional Normalized Maximum Likelihood: Reliable Out of
Distribution Uncertainty Estimation [99.92568326314667]
We propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation.
Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle.
We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.
arXiv Detail & Related papers (2020-11-05T08:04:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.