Related papers: Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods

Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods

URL: http://arxiv.org/abs/2102.13045v1
Date: Thu, 25 Feb 2021 17:55:15 GMT
Title: Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods
Authors: Nicholay Topin, Stephanie Milani, Fei Fang, Manuela Veloso
Abstract summary: Current work in explainable reinforcement learning generally produces policies in the form of a decision tree over the state space. We propose a novel Markov Decision Process (MDP) type for learning decision tree policies: Iterative Bounding MDPs (IBMDPs) IBMDPs are constructed around a base MDP so each IBMDP policy is guaranteed to correspond to a decision tree policy for the base MDP when using a method-agnostic masking procedure.
Score: 37.553463240629135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current work in explainable reinforcement learning generally produces policies in the form of a decision tree over the state space. Such policies can be used for formal safety verification, agent behavior prediction, and manual inspection of important features. However, existing approaches fit a decision tree after training or use a custom learning procedure which is not compatible with new learning techniques, such as those which use neural networks. To address this limitation, we propose a novel Markov Decision Process (MDP) type for learning decision tree policies: Iterative Bounding MDPs (IBMDPs). An IBMDP is constructed around a base MDP so each IBMDP policy is guaranteed to correspond to a decision tree policy for the base MDP when using a method-agnostic masking procedure. Because of this decision tree equivalence, any function approximator can be used during training, including a neural network, while yielding a decision tree policy for the base MDP. We present the required masking procedure as well as a modified value update step which allows IBMDPs to be solved using existing algorithms. We apply this procedure to produce IBMDP variants of recent reinforcement learning methods. We empirically show the benefits of our approach by solving IBMDPs to produce decision tree policies for the base MDPs.

Related papers

MDP Geometry, Normalization and Reward Balancing Solvers [15.627546283580166]
The Markov Decision Process (MDP) is a widely used mathematical model for sequential decision-making problems. We present a new geometric interpretation of MDPs with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy.
arXiv Detail & Related papers (2024-07-09T09:39:45Z)
Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming [8.495921422521068]
Multi-model Markov decision process (MMDP) is a promising framework for computing policies. MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models. We propose CADP, which combines a coordinate ascent method and a dynamic programming algorithm for solving MMDPs.
arXiv Detail & Related papers (2024-07-08T18:47:59Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
Optimal Decision Tree Policies for Markov Decision Processes [7.995360025953931]
We study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees. While there is generally a trade-off between the performance and interpretability of machine learning models, we find that OMDTs limited to a depth of 3 often perform close to the optimal limit.
arXiv Detail & Related papers (2023-01-30T18:51:02Z)
Continuous MDP Homomorphisms and Homomorphic Policy Gradient [51.25171126424949]
We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces. We propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously.
arXiv Detail & Related papers (2022-09-15T15:26:49Z)
Efficient Policy Iteration for Robust Markov Decision Processes via Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known. Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs. In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z)
Bridging the gap between QP-based and MPC-based RL [1.90365714903665]
We approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs) A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy. We illustrate the workings of our proposed method with the resulting structure using a point-mass task.
arXiv Detail & Related papers (2022-05-18T10:41:18Z)
Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ. We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments. We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z)
Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP) We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z)
Strengthening Deterministic Policies for POMDPs [5.092711491848192]
We provide a novel MILP encoding that supports sophisticated specifications in the form of temporal logic constraints. We employ a preprocessing of the POMDP to encompass memory-based decisions. The advantages of our approach lie (1) in the flexibility to strengthen simple deterministic policies without losing computational tractability and (2) in the ability to enforce the provable satisfaction of arbitrarily many specifications.
arXiv Detail & Related papers (2020-07-16T14:22:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.