On the convex formulations of robust Markov decision processes
- URL: http://arxiv.org/abs/2209.10187v2
- Date: Wed, 13 Dec 2023 14:40:03 GMT
- Title: On the convex formulations of robust Markov decision processes
- Authors: Julien Grand-Cl\'ement, Marek Petrik
- Abstract summary: There is no known analog iteration of the MDP convex optimization formulation for solving RMDPs.
We derive a convex formulation with a number of variables and constraints in the number of states and actions, but with large coefficients in the constraints.
- Score: 12.100620186370012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robust Markov decision processes (MDPs) are used for applications of dynamic
optimization in uncertain environments and have been studied extensively. Many
of the main properties and algorithms of MDPs, such as value iteration and
policy iteration, extend directly to RMDPs. Surprisingly, there is no known
analog of the MDP convex optimization formulation for solving RMDPs. This work
describes the first convex optimization formulation of RMDPs under the
classical sa-rectangularity and s-rectangularity assumptions. By using entropic
regularization and exponential change of variables, we derive a convex
formulation with a number of variables and constraints polynomial in the number
of states and actions, but with large coefficients in the constraints. We
further simplify the formulation for RMDPs with polyhedral, ellipsoidal, or
entropy-based uncertainty sets, showing that, in these cases, RMDPs can be
reformulated as conic programs based on exponential cones, quadratic cones, and
non-negative orthants. Our work opens a new research direction for RMDPs and
can serve as a first step toward obtaining a tractable convex formulation of
RMDPs.
Related papers
- Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis [30.713243690224207]
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes.
This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees.
arXiv Detail & Related papers (2024-10-31T16:53:20Z) - Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable.
CMDPs serve as an important framework to model many real-world applications with time-varying environments.
We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z) - Provably Efficient Algorithm for Nonstationary Low-Rank MDPs [48.92657638730582]
We make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time.
We propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL.
For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with sample complexity.
arXiv Detail & Related papers (2023-08-10T09:52:44Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Non-stationary Reinforcement Learning under General Function
Approximation [60.430936031067006]
We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs.
Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA.
We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
arXiv Detail & Related papers (2023-06-01T16:19:37Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Robust Phi-Divergence MDPs [13.555107578858307]
We develop a novel solution framework for robust MDPs with s-rectangular ambiguity sets.
We show that the associated s-rectangular robust MDPs can be solved substantially faster than with state-of-the-art commercial solvers.
arXiv Detail & Related papers (2022-05-27T19:08:55Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z) - CP-MDP: A CANDECOMP-PARAFAC Decomposition Approach to Solve a Markov
Decision Process Multidimensional Problem [21.79259092920586]
We develop an MDP solver for a multidimensional problem using a tensor decomposition method.
We show that our approach can compute much larger problems using substantially less memory.
arXiv Detail & Related papers (2021-02-27T21:33:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.