Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with
Non-stationary Objectives and Constraints
- URL: http://arxiv.org/abs/2201.11965v1
- Date: Fri, 28 Jan 2022 07:18:29 GMT
- Title: Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with
Non-stationary Objectives and Constraints
- Authors: Yuhao Ding and Javad Lavaei
- Abstract summary: We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints.
We propose a Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that features three mechanisms: periodic-restart-based policy improvement, dual update with dual regularization, and periodic-restart-based optimistic policy evaluation.
- Score: 8.840221198764482
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider primal-dual-based reinforcement learning (RL) in episodic
constrained Markov decision processes (CMDPs) with non-stationary objectives
and constraints, which play a central role in ensuring the safety of RL in
time-varying environments. In this problem, the reward/utility functions and
the state transition functions are both allowed to vary arbitrarily over time
as long as their cumulative variations do not exceed certain known variation
budgets. Designing safe RL algorithms in time-varying environments is
particularly challenging because of the need to integrate the constraint
violation reduction, safe exploration, and adaptation to the non-stationarity.
To this end, we propose a Periodically Restarted Optimistic Primal-Dual
Proximal Policy Optimization (PROPD-PPO) algorithm that features three
mechanisms: periodic-restart-based policy improvement, dual update with dual
regularization, and periodic-restart-based optimistic policy evaluation. We
establish a dynamic regret bound and a constraint violation bound for the
proposed algorithm in both the linear kernel CMDP function approximation
setting and the tabular CMDP setting. This paper provides the first provably
efficient algorithm for non-stationary CMDPs with safe exploration.
Related papers
- Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning [22.333460316347264]
We introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies.
We develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint.
arXiv Detail & Related papers (2025-02-07T09:30:35Z) - Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization [10.465789490644031]
We propose a novel framework for robust regularized Markov decision process ($d$-RRMDP)
For the offline RL setting, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI)
arXiv Detail & Related papers (2024-11-27T18:57:03Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality [55.06411438416805]
Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications.
This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS- DDR) to efficiently train parametric actor policies.
It is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Constrained Proximal Policy Optimization [36.20839673950677]
We propose a novel first-order feasible method named Constrained Proximal Policy Optimization (CPPO)
Our approach integrates the Expectation-Maximization framework to solve it through two steps: 1) calculating the optimal policy distribution within the feasible region (E-step), and 2) conducting a first-order update to adjust the current policy towards the optimal policy obtained in the E-step (M-step)
Empirical evaluations conducted in complex and uncertain environments validate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-23T16:33:55Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.