Related papers: Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming

Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming

URL: http://arxiv.org/abs/2407.06329v1
Date: Mon, 8 Jul 2024 18:47:59 GMT
Title: Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming
Authors: Xihong Su, Marek Petrik,
Abstract summary: Multi-model Markov decision process (MMDP) is a promising framework for computing policies. MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models. We propose CADP, which combines a coordinate ascent method and a dynamic programming algorithm for solving MMDPs.
Score: 8.495921422521068
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multi-model Markov decision process (MMDP) is a promising framework for computing policies that are robust to parameter uncertainty in MDPs. MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models. Because MMDPs are NP-hard to solve, most methods resort to approximations. In this paper, we derive the policy gradient of MMDPs and propose CADP, which combines a coordinate ascent method and a dynamic programming algorithm for solving MMDPs. The main innovation of CADP compared with earlier algorithms is to take the coordinate ascent perspective to adjust model weights iteratively to guarantee monotone policy improvements to a local maximum. A theoretical analysis of CADP proves that it never performs worse than previous dynamic programming algorithms like WSU. Our numerical results indicate that CADP substantially outperforms existing methods on several benchmark problems.

Related papers

Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis [30.713243690224207]
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees.
arXiv Detail & Related papers (2024-10-31T16:53:20Z)
Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z)
MDP Geometry, Normalization and Reward Balancing Solvers [15.627546283580166]
The Markov Decision Process (MDP) is a widely used mathematical model for sequential decision-making problems. We present a new geometric interpretation of MDPs with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy.
arXiv Detail & Related papers (2024-07-09T09:39:45Z)
Domain-Independent Dynamic Programming [5.449167190254984]
Domain-independent dynamic programming (DIDP) is a new model-based paradigm based on dynamic programming (DP) We introduce Dynamic Programming Description Language (DyPDL), a formalism to define DP models based on a state transition system, inspired by AI planning. We show that search algorithms can be used to solve DyPDL models and propose seven DIDP solvers.
arXiv Detail & Related papers (2024-01-25T01:48:09Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
Non-stationary Reinforcement Learning under General Function Approximation [60.430936031067006]
We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs. Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA. We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
arXiv Detail & Related papers (2023-06-01T16:19:37Z)
Policy Gradient in Robust MDPs with Global Convergence Guarantee [13.40471012593073]
Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy.
arXiv Detail & Related papers (2022-12-20T17:14:14Z)
Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems. POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid. We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Twice regularized MDPs and the equivalence between robustness and regularization [65.58188361659073]
We show that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We generalize regularized MDPs to twice regularized MDPs.
arXiv Detail & Related papers (2021-10-12T18:33:45Z)
Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP) We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z)
A Relation Analysis of Markov Decision Process Frameworks [26.308541799686505]
We study the relation between different Decision Process (MDP) frameworks in the machine learning and econometrics literature. We show that the entropy-regularized MDP is equivalent to a MDP model, and is strictly subsumed by the general regularized MDP.
arXiv Detail & Related papers (2020-08-18T09:27:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.