Scalable Policy-Based RL Algorithms for POMDPs
- URL: http://arxiv.org/abs/2510.06540v2
- Date: Mon, 13 Oct 2025 22:52:44 GMT
- Title: Scalable Policy-Based RL Algorithms for POMDPs
- Authors: Ameya Anjarlekar, Rasoul Etesami, R Srikant,
- Abstract summary: We consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating a POMDP model into a finite-state Markov Decision Process (MDP)<n>We show that the approximation error decreases exponentially with the length of this history.<n>To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
- Score: 6.2229686397601585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
Related papers
- Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability [29.63953552645502]
Multi-environment POMDPs (ME-POMDPs) extend standard POMDPs with discrete model uncertainty.<n>We show that ME-POMDPs can be generalized to POMDPs with sets of initial beliefs.<n>We then devise exact and approximate (point-based) algorithms to compute robust policies for AB-POMDPs.
arXiv Detail & Related papers (2025-10-27T18:24:11Z) - Sequential Monte Carlo for Policy Optimization in Continuous POMDPs [10.675160254513615]
We introduce a novel policy optimization framework for continuous partially observable Markov decision processes.<n>Our method casts policy learning as probabilistic inference in a non-Markovian Feynman--Kac model.<n>We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks.
arXiv Detail & Related papers (2025-05-22T14:45:46Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.<n>We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.<n>This appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Recursively-Constrained Partially Observable Markov Decision Processes [13.8724466775267]
We show that C-POMDPs violate the optimal substructure property over successive decision steps.
Online re-planning in C-POMDPs is often ineffective due to the inconsistency resulting from this violation.
We introduce the Recursively-Constrained POMDP, which imposes additional history-dependent cost constraints on the C-POMDP.
arXiv Detail & Related papers (2023-10-15T00:25:07Z) - A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems.
POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid.
We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.