Verifiable Planning in Expected Reward Multichain MDPs
- URL: http://arxiv.org/abs/2012.02178v1
- Date: Thu, 3 Dec 2020 18:54:24 GMT
- Title: Verifiable Planning in Expected Reward Multichain MDPs
- Authors: George K. Atia, Andre Beckus, Ismail Alkhouri, Alvaro Velasquez
- Abstract summary: We explore the steady-state planning problem of deriving a decision-making policy for an agent.
We prove that optimal solutions to the proposed programs yield stationary policies with rigorous guarantees of behavior.
- Score: 20.456052208569115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The planning domain has experienced increased interest in the formal
synthesis of decision-making policies. This formal synthesis typically entails
finding a policy which satisfies formal specifications in the form of some
well-defined logic, such as Linear Temporal Logic (LTL) or Computation Tree
Logic (CTL), among others. While such logics are very powerful and expressive
in their capacity to capture desirable agent behavior, their value is limited
when deriving decision-making policies which satisfy certain types of
asymptotic behavior. In particular, we are interested in specifying constraints
on the steady-state behavior of an agent, which captures the proportion of time
an agent spends in each state as it interacts for an indefinite period of time
with its environment. This is sometimes called the average or expected behavior
of the agent. In this paper, we explore the steady-state planning problem of
deriving a decision-making policy for an agent such that constraints on its
steady-state behavior are satisfied. A linear programming solution for the
general case of multichain Markov Decision Processes (MDPs) is proposed and we
prove that optimal solutions to the proposed programs yield stationary policies
with rigorous guarantees of behavior.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Recursively-Constrained Partially Observable Markov Decision Processes [13.8724466775267]
We show that C-POMDPs violate the optimal substructure property over successive decision steps.
Online re-planning in C-POMDPs is often ineffective due to the inconsistency resulting from this violation.
We introduce the Recursively-Constrained POMDP, which imposes additional history-dependent cost constraints on the C-POMDP.
arXiv Detail & Related papers (2023-10-15T00:25:07Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Risk-Averse Decision Making Under Uncertainty [18.467950783426947]
A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs)
In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures.
arXiv Detail & Related papers (2021-09-09T07:52:35Z) - LTL-Constrained Steady-State Policy Synthesis [0.0]
We study Markov decision processes (MDP) with the specification combining all these types.
We provide a unified solution reducing the multi-type specification to a multi-dimensional long-run average reward.
The algorithm also extends to the general $omega$-regular properties and runs in time in the sizes of the MDP as well as the LDBA.
arXiv Detail & Related papers (2021-05-31T11:35:42Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - Identification of Unexpected Decisions in Partially Observable
Monte-Carlo Planning: a Rule-Based Approach [78.05638156687343]
We propose a methodology for analyzing POMCP policies by inspecting their traces.
The proposed method explores local properties of policy behavior to identify unexpected decisions.
We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to mobile robot navigation.
arXiv Detail & Related papers (2020-12-23T15:09:28Z) - Strengthening Deterministic Policies for POMDPs [5.092711491848192]
We provide a novel MILP encoding that supports sophisticated specifications in the form of temporal logic constraints.
We employ a preprocessing of the POMDP to encompass memory-based decisions.
The advantages of our approach lie (1) in the flexibility to strengthen simple deterministic policies without losing computational tractability and (2) in the ability to enforce the provable satisfaction of arbitrarily many specifications.
arXiv Detail & Related papers (2020-07-16T14:22:55Z) - Multiagent Value Iteration Algorithms in Dynamic Programming and
Reinforcement Learning [0.0]
We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions.
In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order.
arXiv Detail & Related papers (2020-05-04T16:34:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.