Safe Policy Improvement for POMDPs via Finite-State Controllers
- URL: http://arxiv.org/abs/2301.04939v1
- Date: Thu, 12 Jan 2023 11:22:54 GMT
- Title: Safe Policy Improvement for POMDPs via Finite-State Controllers
- Authors: Thiago D. Sim\~ao, Marnix Suilen, Nils Jansen
- Abstract summary: We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs)
SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner.
We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability.
- Score: 6.022036788651133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study safe policy improvement (SPI) for partially observable Markov
decision processes (POMDPs). SPI is an offline reinforcement learning (RL)
problem that assumes access to (1) historical data about an environment, and
(2) the so-called behavior policy that previously generated this data by
interacting with the environment. SPI methods neither require access to a model
nor the environment itself, and aim to reliably improve the behavior policy in
an offline manner. Existing methods make the strong assumption that the
environment is fully observable. In our novel approach to the SPI problem for
POMDPs, we assume that a finite-state controller (FSC) represents the behavior
policy and that finite memory is sufficient to derive optimal policies. This
assumption allows us to map the POMDP to a finite-state fully observable MDP,
the history MDP. We estimate this MDP by combining the historical data and the
memory of the FSC, and compute an improved policy using an off-the-shelf SPI
algorithm. The underlying SPI method constrains the policy-space according to
the available data, such that the newly computed policy only differs from the
behavior policy when sufficient data was available. We show that this new
policy, converted into a new FSC for the (unknown) POMDP, outperforms the
behavior policy with high probability. Experimental results on several
well-established benchmarks show the applicability of the approach, even in
cases where finite memory is not sufficient.
Related papers
- Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - More for Less: Safe Policy Improvement With Stronger Performance
Guarantees [7.507789621505201]
The safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated.
We present a novel approach to the SPI problem that provides the means to require less data for such guarantees.
arXiv Detail & Related papers (2023-05-13T16:22:21Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Robust Anytime Learning of Markov Decision Processes [8.799182983019557]
In data-driven applications, deriving precise probabilities from limited data introduces statistical errors.
Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions.
We propose a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies.
arXiv Detail & Related papers (2022-05-31T14:29:55Z) - BATS: Best Action Trajectory Stitching [22.75880303352508]
We introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset.
We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics.
We show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.
arXiv Detail & Related papers (2022-04-26T01:48:32Z) - Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ.
We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments.
We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.