Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints
- URL: http://arxiv.org/abs/2509.20114v2
- Date: Thu, 02 Oct 2025 08:29:31 GMT
- Title: Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints
- Authors: Francesco Emanuele Stradi, Eleonora Fidelia Chiefari, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti,
- Abstract summary: We study emphonline episodic Constrained Markov Decision Processes (CMDPs) under both and adversarial constraints.<n>We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al.<n>In the adversarial regime, emphi.e., our algorithm ensures sublinear constraint violation without Slater's condition, and sublinear $alpha$-regret with respect to the emphunconstrained optimum.
- Score: 33.41566575424402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $\alpha$-regret with respect to the \emph{unconstrained} optimum, where $\alpha$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.
Related papers
- Truly Adapting to Adversarial Constraints in Constrained MABs [33.41566575424402]
We study the constrained variant of the emphmulti-armed bandit (MAB) problem.<n>We propose an algorithm attaining $widetildemathcalO(sqrtT+C)$ regret and $widetildemathcalO(sqrtT+C)$ positive violation.<n>We then show how to extend these guarantees when only bandit feedback is available for the losses.
arXiv Detail & Related papers (2026-02-16T08:07:11Z) - Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality [53.525547349715595]
We propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO)<n>RRPO operates directly on the primal problem without relying on dual formulations.<n>We show convergence to an approximately optimal feasible policy with complexity matching the best-known lower bound.
arXiv Detail & Related papers (2025-08-24T16:59:38Z) - Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds [28.4976864705409]
This paper studies constrained Markov decision processes (CMDPs) with constraints against thresholds, aiming at the safety of reinforcement learning in unknown and uncertain environments.<n>We leverage a GrowingWindow estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Pessimistic-Optimistic Thresholding (SPOT)<n>SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings.
arXiv Detail & Related papers (2025-04-07T11:58:19Z) - Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback [34.7178680288326]
Stradi et al.(2024) proposed the first best-of-both-worlds algorithm for constrained Markov decision processes.<n>In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback.<n>Our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
arXiv Detail & Related papers (2024-10-03T07:44:40Z) - Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints [34.7178680288326]
In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining sublinear regret and sublinear constraint violation.
We show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases.
arXiv Detail & Related papers (2024-05-23T09:48:48Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Safe Online Bid Optimization with Return-On-Investment and Budget
Constraints subject to Uncertainty [87.81197574939355]
We study the nature of both the optimization and learning problems.
We provide an algorithm, namely GCB, guaranteeing sublinear regret at the cost of a potentially linear number of constraints violations.
More interestingly, we provide an algorithm, namely GCB_safe(psi,phi), guaranteeing both sublinear pseudo-regret and safety w.h.p. at the cost of accepting tolerances psi and phi.
arXiv Detail & Related papers (2022-01-18T17:24:20Z) - Misspecified Gaussian Process Bandit Optimization [59.30399661155574]
Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem.
We introduce a emphmisspecified kernelized bandit setting where the unknown function can be $epsilon$--uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS)
We show that our algorithm achieves optimal dependence on $epsilon$ with no prior knowledge of misspecification.
arXiv Detail & Related papers (2021-11-09T09:00:02Z) - Fast Global Convergence of Policy Optimization for Constrained MDPs [17.825031573375725]
We show that gradient-based methods can achieve an $mathcalO(log(T)/T)$ global convergence rate both for the optimality gap and the constraint violation.
When Slater's condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large $T$.
arXiv Detail & Related papers (2021-10-31T17:46:26Z) - On Lower Bounds for Standard and Robust Gaussian Process Bandit
Optimization [55.937424268654645]
We consider algorithm-independent lower bounds for the problem of black-box optimization of functions having a bounded norm.
We provide a novel proof technique for deriving lower bounds on the regret, with benefits including simplicity, versatility, and an improved dependence on the error probability.
arXiv Detail & Related papers (2020-08-20T03:48:14Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.