Iterative Bounding MDPs: Learning Interpretable Policies via
Non-Interpretable Methods
- URL: http://arxiv.org/abs/2102.13045v1
- Date: Thu, 25 Feb 2021 17:55:15 GMT
- Title: Iterative Bounding MDPs: Learning Interpretable Policies via
Non-Interpretable Methods
- Authors: Nicholay Topin, Stephanie Milani, Fei Fang, Manuela Veloso
- Abstract summary: Current work in explainable reinforcement learning generally produces policies in the form of a decision tree over the state space.
We propose a novel Markov Decision Process (MDP) type for learning decision tree policies: Iterative Bounding MDPs (IBMDPs)
IBMDPs are constructed around a base MDP so each IBMDP policy is guaranteed to correspond to a decision tree policy for the base MDP when using a method-agnostic masking procedure.
- Score: 37.553463240629135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current work in explainable reinforcement learning generally produces
policies in the form of a decision tree over the state space. Such policies can
be used for formal safety verification, agent behavior prediction, and manual
inspection of important features. However, existing approaches fit a decision
tree after training or use a custom learning procedure which is not compatible
with new learning techniques, such as those which use neural networks. To
address this limitation, we propose a novel Markov Decision Process (MDP) type
for learning decision tree policies: Iterative Bounding MDPs (IBMDPs). An IBMDP
is constructed around a base MDP so each IBMDP policy is guaranteed to
correspond to a decision tree policy for the base MDP when using a
method-agnostic masking procedure. Because of this decision tree equivalence,
any function approximator can be used during training, including a neural
network, while yielding a decision tree policy for the base MDP. We present the
required masking procedure as well as a modified value update step which allows
IBMDPs to be solved using existing algorithms. We apply this procedure to
produce IBMDP variants of recent reinforcement learning methods. We empirically
show the benefits of our approach by solving IBMDPs to produce decision tree
policies for the base MDPs.
Related papers
- MDP Geometry, Normalization and Value Free Solvers [15.627546283580166]
We present a new geometric interpretation of MDP, which is useful for analyzing the dynamics of main MDP algorithms.
We show that MDPs can be split into equivalence classes with indistinguishable algorithm dynamics.
arXiv Detail & Related papers (2024-07-09T09:39:45Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Optimal Decision Tree Policies for Markov Decision Processes [7.995360025953931]
We study the optimization of size-limited decision trees for Markov Decision Processes (MPDs)
We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees.
While there is generally a trade-off between the performance and interpretability of machine learning models, we find that OMDTs limited to a depth of 3 often perform close to the optimal limit.
arXiv Detail & Related papers (2023-01-30T18:51:02Z) - Continuous MDP Homomorphisms and Homomorphic Policy Gradient [51.25171126424949]
We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces.
We propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously.
arXiv Detail & Related papers (2022-09-15T15:26:49Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Bridging the gap between QP-based and MPC-based RL [1.90365714903665]
We approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs)
A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy.
We illustrate the workings of our proposed method with the resulting structure using a point-mass task.
arXiv Detail & Related papers (2022-05-18T10:41:18Z) - Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ.
We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments.
We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - Strengthening Deterministic Policies for POMDPs [5.092711491848192]
We provide a novel MILP encoding that supports sophisticated specifications in the form of temporal logic constraints.
We employ a preprocessing of the POMDP to encompass memory-based decisions.
The advantages of our approach lie (1) in the flexibility to strengthen simple deterministic policies without losing computational tractability and (2) in the ability to enforce the provable satisfaction of arbitrarily many specifications.
arXiv Detail & Related papers (2020-07-16T14:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.