Related papers: Safe Policy Improvement for POMDPs via Finite-State Controllers

Safe Policy Improvement for POMDPs via Finite-State Controllers

URL: http://arxiv.org/abs/2301.04939v1
Date: Thu, 12 Jan 2023 11:22:54 GMT
Title: Safe Policy Improvement for POMDPs via Finite-State Controllers
Authors: Thiago D. Sim\~ao, Marnix Suilen, Nils Jansen
Abstract summary: We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs) SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability.
Score: 6.022036788651133
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.

Related papers

Data-Efficient Safe Policy Improvement Using Parametric Structure [6.914228980072897]
We make safe policy improvement (SPI) more data-efficient through three contributions.<n>A parametric SPI algorithm exploits known correlations between distributions to more accurately estimate the transition dynamics.<n>A more advanced preprocessing technique, based on satisfiability modulo theory (SMT) solving, can identify more actions to prune.
arXiv Detail & Related papers (2025-07-21T12:00:03Z)
Decision-Point Guided Safe Policy Improvement [22.885394395400592]
Decision Points RL (DPRL) is an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states while still utilizing data from sparsely visited states.
arXiv Detail & Related papers (2024-10-12T04:05:56Z)
Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters. We learn and analyse IMDPs for a set of unknown sample environments induced by parameters. We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z)
Robust Anytime Learning of Markov Decision Processes [8.799182983019557]
In data-driven applications, deriving precise probabilities from limited data introduces statistical errors. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions. We propose a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies.
arXiv Detail & Related papers (2022-05-31T14:29:55Z)
Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z)
BATS: Best Action Trajectory Stitching [22.75880303352508]
We introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. We show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.
arXiv Detail & Related papers (2022-04-26T01:48:32Z)
Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ. We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments. We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z)
Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP) We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z)
Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting. We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.