Towards Painless Policy Optimization for Constrained MDPs
- URL: http://arxiv.org/abs/2204.05176v1
- Date: Mon, 11 Apr 2022 15:08:09 GMT
- Title: Towards Painless Policy Optimization for Constrained MDPs
- Authors: Arushi Jain, Sharan Vaswani, Reza Babanezhad, Csaba Szepesvari, Doina
Precup
- Abstract summary: We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
- Score: 46.12526917024248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study policy optimization in an infinite horizon, $\gamma$-discounted
constrained Markov decision process (CMDP). Our objective is to return a policy
that achieves large expected reward with a small constraint violation. We
consider the online setting with linear function approximation and assume
global access to the corresponding features. We propose a generic primal-dual
framework that allows us to bound the reward sub-optimality and constraint
violation for arbitrary algorithms in terms of their primal and dual regret on
online linear optimization problems. We instantiate this framework to use
coin-betting algorithms and propose the Coin Betting Politex (CBP) algorithm.
Assuming that the action-value functions are $\varepsilon_b$-close to the span
of the $d$-dimensional state-action features and no sampling errors, we prove
that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - \gamma)^3
\sqrt{T}} + \frac{\varepsilon_b\sqrt{d}}{(1 - \gamma)^2} \right)$ reward
sub-optimality and an $O\left(\frac{1}{(1 - \gamma)^2 \sqrt{T}} +
\frac{\varepsilon_b \sqrt{d}}{1 - \gamma} \right)$ constraint violation.
Importantly, unlike gradient descent-ascent and other recent methods, CBP does
not require extensive hyperparameter tuning. Via experiments on synthetic and
Cartpole environments, we demonstrate the effectiveness and robustness of CBP.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.