Near Optimality of Finite Memory Feedback Policies in Partially Observed
Markov Decision Processes
- URL: http://arxiv.org/abs/2010.07452v2
- Date: Sat, 8 Jan 2022 13:31:12 GMT
- Title: Near Optimality of Finite Memory Feedback Policies in Partially Observed
Markov Decision Processes
- Authors: Ali Devran Kara and Serdar Yuksel
- Abstract summary: We study a planning problem for POMDPs where the system dynamics and measurement channel model is assumed to be known.
We find optimal policies for the approximate belief model under mild non-linear filter stability conditions.
We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the theory of Partially Observed Markov Decision Processes (POMDPs),
existence of optimal policies have in general been established via converting
the original partially observed stochastic control problem to a fully observed
one on the belief space, leading to a belief-MDP. However, computing an optimal
policy for this fully observed model, and so for the original POMDP, using
classical dynamic or linear programming methods is challenging even if the
original system has finite state and action spaces, since the state space of
the fully observed belief-MDP model is always uncountable. Furthermore, there
exist very few rigorous value function approximation and optimal policy
approximation results, as regularity conditions needed often require a tedious
study involving the spaces of probability measures leading to properties such
as Feller continuity. In this paper, we study a planning problem for POMDPs
where the system dynamics and measurement channel model are assumed to be
known. We construct an approximate belief model by discretizing the belief
space using only finite window information variables. We then find optimal
policies for the approximate model and we rigorously establish near optimality
of the constructed finite window control policies in POMDPs under mild
non-linear filter stability conditions and the assumption that the measurement
and action sets are finite (and the state space is real vector valued). We also
establish a rate of convergence result which relates the finite window memory
size and the approximation error bound, where the rate of convergence is
exponential under explicit and testable exponential filter stability
conditions. While there exist many experimental results and few rigorous
asymptotic convergence results, an explicit rate of convergence result is new
in the literature, to our knowledge.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study evaluating a policy under best- and worst-case perturbations to a Markov decision process (MDP)
This is an important problem when there is the possibility of a shift between historical and future environments.
We propose a perturbation model that can modify transition kernel densities up to a given multiplicative factor or its reciprocal.
arXiv Detail & Related papers (2024-03-29T18:11:49Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Double Duality: Variational Primal-Dual Policy Optimization for
Constrained Reinforcement Learning [132.7040981721302]
We study the Constrained Convex Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure.
Design algorithms for a constrained convex MDP faces several challenges, including handling the large state space.
arXiv Detail & Related papers (2024-02-16T16:35:18Z) - Online POMDP Planning with Anytime Deterministic Guarantees [11.157761902108692]
Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs)
Finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks.
We derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one.
arXiv Detail & Related papers (2023-10-03T04:40:38Z) - Optimal Learning via Moderate Deviations Theory [4.6930976245638245]
We develop a systematic construction of highly accurate confidence intervals by using a moderate deviation principle-based approach.
It is shown that the proposed confidence intervals are statistically optimal in the sense that they satisfy criteria regarding exponential accuracy, minimality, consistency, mischaracterization probability, and eventual uniformly most accurate (UMA) property.
arXiv Detail & Related papers (2023-05-23T19:57:57Z) - Computationally Efficient PAC RL in POMDPs with Latent Determinism and
Conditional Embeddings [97.12538243736705]
We study reinforcement learning with function approximation for large-scale Partially Observable Decision Processes (POMDPs)
Our algorithm provably scales to large-scale POMDPs.
arXiv Detail & Related papers (2022-06-24T05:13:35Z) - Convergence and sample complexity of natural policy gradient primal-dual
methods for constrained MDPs [24.582720609592464]
We employ the natural policy gradient method to solve the discounted optimal optimal rate problem.
We also provide convergence and finite-sample guarantees for two sample-based NPG-PD algorithms.
arXiv Detail & Related papers (2022-06-06T04:28:04Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality
of Learned Policies under Filter Stability [0.0]
For POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions.
We present explicit error bounds relating the approximation error to the length of the finite history window.
We show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP.
arXiv Detail & Related papers (2021-03-22T20:14:26Z) - Amortized Conditional Normalized Maximum Likelihood: Reliable Out of
Distribution Uncertainty Estimation [99.92568326314667]
We propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation.
Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle.
We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.
arXiv Detail & Related papers (2020-11-05T08:04:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.