Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs
- URL: http://arxiv.org/abs/2505.09518v2
- Date: Fri, 20 Jun 2025 11:24:24 GMT
- Title: Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs
- Authors: Maris F. L. Galesloot, Roman Andriushchenko, Milan Češka, Sebastian Junges, Nils Jansen,
- Abstract summary: A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs.<n>We show that our approach produces policies that are more robust and generalize better to unseen POMDPs.
- Score: 7.447371788025412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs.We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP, and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.
Related papers
- Efficient Solution and Learning of Robust Factored MDPs [57.2416302384766]
Learning r-MDPs from interactions with an unknown environment enables the synthesis of robust policies with provable guarantees on performance.<n>We propose novel methods for solving and learning r-MDPs based on factored state representations.
arXiv Detail & Related papers (2025-08-01T15:23:15Z) - Efficient Strategy Synthesis for MDPs via Hierarchical Block Decomposition [47.123254940289726]
Software product lines and robotics utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems.<n>Despite the usefulness of conventional policy synthesis methods, they fail to scale to large state spaces.<n>Our approach addresses this issue and accelerates policy synthesis in large MDPs by dynamically refining the MDP and iteratively selecting the most fragile MDP regions for refinement.
arXiv Detail & Related papers (2025-06-21T19:03:03Z) - Pessimistic Iterative Planning for Robust POMDPs [33.73695799565586]
We propose a pessimistic iterative planning (PIP) framework to compute robust memory-based POMDP policies.
Within PIP, we propose the rFSCNet algorithm, which finds an FSC through a recurrent neural network by using supervision policies optimized for the pessimistic POMDP.
In each iteration, rFSCNet finds an FSC through a recurrent neural network by using supervision policies optimized for the pessimistic POMDP.
arXiv Detail & Related papers (2024-08-16T14:25:20Z) - Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.<n>We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.<n>We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z) - Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems.
POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid.
We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z) - Bayesian regularization of empirical MDPs [11.3458118258705]
We take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information.
We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store.
arXiv Detail & Related papers (2022-08-03T22:02:50Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Robust Entropy-regularized Markov Decision Processes [23.719568076996662]
We study a robust version of the ER-MDP model, where the optimal policies are required to be robust.
We show that essential properties that hold for the non-robust ER-MDP and robust unregularized MDP models also hold in our settings.
We show how our framework and results can be integrated into different algorithmic schemes including value or (modified) policy.
arXiv Detail & Related papers (2021-12-31T09:50:46Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.