Related papers: Minimax Regret Optimisation for Robust Planning in Uncertain Markov Decision Processes

Minimax Regret Optimisation for Robust Planning in Uncertain Markov Decision Processes

URL: http://arxiv.org/abs/2012.04626v1
Date: Tue, 8 Dec 2020 18:48:14 GMT
Title: Minimax Regret Optimisation for Robust Planning in Uncertain Markov Decision Processes
Authors: Marc Rigter, Bruno Lacerda, Nick Hawes
Abstract summary: Minimax regret has been proposed as an objective for planning in Uncertain MDPs to find robust policies. We introduce a Bellman equation to compute the regret for a policy. We show that it optimises minimax regret exactly for UMDPs with independent uncertainties.
Score: 3.5289688061934963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The parameters for a Markov Decision Process (MDP) often cannot be specified exactly. Uncertain MDPs (UMDPs) capture this model ambiguity by defining sets which the parameters belong to. Minimax regret has been proposed as an objective for planning in UMDPs to find robust policies which are not overly conservative. In this work, we focus on planning for Stochastic Shortest Path (SSP) UMDPs with uncertain cost and transition functions. We introduce a Bellman equation to compute the regret for a policy. We propose a dynamic programming algorithm that utilises the regret Bellman equation, and show that it optimises minimax regret exactly for UMDPs with independent uncertainties. For coupled uncertainties, we extend our approach to use options to enable a trade off between computation and solution quality. We evaluate our approach on both synthetic and real-world domains, showing that it significantly outperforms existing baselines.

Related papers

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs [16.49229317664822]
We study the problem of infinite-horizon average-reward reinforcement learning with linear decision processes (MDPs) Our approach approximates the average-reward setting by a discounted discounting factor, then applies an optimistic value iteration.
arXiv Detail & Related papers (2024-05-23T20:58:33Z)
Anytime-Constrained Reinforcement Learning [6.981971551979697]
We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. We show that there exist optimal deterministic policies augmented with cumulative costs. We show that computing non-trivial approximately optimal policies is NP-hard in general.
arXiv Detail & Related papers (2023-11-09T16:51:26Z)
Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes. We propose an algorithm based on Thompson sampling with dynamically-sized episodes. We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z)
Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets [10.26382228865201]
We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets. The corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable. We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees.
arXiv Detail & Related papers (2023-05-30T13:02:25Z)
Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments [48.96971760679639]
We study variance-dependent regret bounds for Markov decision processes (MDPs) We propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm. In particular, this bound is simultaneously minimax optimal for both and deterministic MDPs.
arXiv Detail & Related papers (2023-01-31T06:54:06Z)
Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z)
Efficient Policy Iteration for Robust Markov Decision Processes via Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known. Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs. In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z)
Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z)
Adaptive Belief Discretization for POMDP Planning [7.508023795800546]
Many POMDP solvers uniformly discretize the belief space and give the planning error in terms of the (typically unknown) covering number. We propose an adaptive belief discretization scheme, and give its associated planning error. We demonstrate that our algorithm is highly competitive with the state of the art in a variety of scenarios.
arXiv Detail & Related papers (2021-04-15T07:04:32Z)
Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs) Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z)
Exploiting Submodular Value Functions For Scaling Up Active Perception [60.81276437097671]
In active perception tasks, agent aims to select sensory actions that reduce uncertainty about one or more hidden variables. Partially observable Markov decision processes (POMDPs) provide a natural model for such problems. As the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially.
arXiv Detail & Related papers (2020-09-21T09:11:36Z)
Efficient Planning in Large MDPs with Weak Linear Function Approximation [4.56877715768796]
Large-scale decision processes (MDPs) require planning algorithms independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements.
arXiv Detail & Related papers (2020-07-13T04:40:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.