An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints
- URL: http://arxiv.org/abs/2505.21841v1
- Date: Wed, 28 May 2025 00:16:34 GMT
- Title: An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints
- Authors: Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, Honghao Wei,
- Abstract summary: Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity.<n>The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs)<n>Existing methods achieve sublinear regret under constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed.<n>We propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints.
- Score: 7.275101606364466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret O(sqrt(K)) and strong constraint violation O(sqrt(K)) without relying on Slater's condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.
Related papers
- Flipping-based Policy for Chance-Constrained Markov Decision Processes [9.404184937255694]
This paper proposes a textitflipping-based policy for Chance-Constrained Markov Decision Processes ( CCMDPs)
The flipping-based policy selects the next action by tossing a potentially distorted coin between two action candidates.
We demonstrate that the flipping-based policy can improve the performance of the existing safe RL algorithms under the same limits of safety constraints.
arXiv Detail & Related papers (2024-10-09T02:00:39Z) - Learning Adversarial MDPs with Stochastic Hard Constraints [37.24692425018]
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and hard constraints.<n>Our work is the first to study CMDPs involving both adversarial losses and hard constraints.
arXiv Detail & Related papers (2024-03-06T12:49:08Z) - Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - Learning Predictive Safety Filter via Decomposition of Robust Invariant
Set [6.94348936509225]
This paper presents advantages of both RMPC and RL RL to synthesize safety filters for nonlinear systems.
We propose a policy approach for robust reach problems and establish its complexity.
arXiv Detail & Related papers (2023-11-12T08:11:28Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Safe Online Bid Optimization with Return-On-Investment and Budget
Constraints subject to Uncertainty [87.81197574939355]
We study the nature of both the optimization and learning problems.
We provide an algorithm, namely GCB, guaranteeing sublinear regret at the cost of a potentially linear number of constraints violations.
More interestingly, we provide an algorithm, namely GCB_safe(psi,phi), guaranteeing both sublinear pseudo-regret and safety w.h.p. at the cost of accepting tolerances psi and phi.
arXiv Detail & Related papers (2022-01-18T17:24:20Z) - Deep Constrained Q-learning [15.582910645906145]
In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a set of constraints.
We propose Constrained Q-learning, a novel off-policy reinforcement learning framework restricting the action space directly in the Q-update to learn the optimal Q-function for the induced constrained MDP and the corresponding safe policy.
arXiv Detail & Related papers (2020-03-20T17:26:03Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.