Value-at-Risk Constrained Policy Optimization
- URL: http://arxiv.org/abs/2601.22993v1
- Date: Fri, 30 Jan 2026 13:57:47 GMT
- Title: Value-at-Risk Constrained Policy Optimization
- Authors: Rohan Tangri, Jan-Peter Calliess,
- Abstract summary: VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments.<n>We employ the one-sided Chebyshev inequality to obtain a tractable surrogate based on the first two moments of the cost return.
- Score: 0.042970700836450486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constraints directly. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ the one-sided Chebyshev inequality to obtain a tractable surrogate based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide rigorous worst-case bounds for both policy improvement and constraint violation during the training process.
Related papers
- Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality [53.525547349715595]
We propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO)<n>RRPO operates directly on the primal problem without relying on dual formulations.<n>We show convergence to an approximately optimal feasible policy with complexity matching the best-known lower bound.
arXiv Detail & Related papers (2025-08-24T16:59:38Z) - Proactive Constrained Policy Optimization with Preemptive Penalty [11.93135424276656]
We propose a novel preemptive penalty mechanism for constrained policy optimization.<n>This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost.<n>We also introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary.
arXiv Detail & Related papers (2025-08-03T18:35:55Z) - SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL [54.022106606140774]
We present theoretical results that place a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setting.<n>This bound can be applied to temporally-extended properties (beyond safety) and to robust control problems.<n>We present experimental results demonstrating this trade-off and comparing the theoretical bound to posterior bounds derived from empirical violation rates.
arXiv Detail & Related papers (2025-04-08T19:09:07Z) - Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes policy space to ensure trust regions contain only safe policies.<n>Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.
arXiv Detail & Related papers (2024-11-05T09:55:50Z) - Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints [52.37099916582462]
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints.
We propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN)
PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration.
arXiv Detail & Related papers (2024-07-22T10:57:32Z) - A Convex Framework for Confounding Robust Inference [21.918894096307294]
We study policy evaluation of offline contextual bandits subject to unobserved confounders.
We propose a general estimator that provides a sharp lower bound of the policy value using convex programming.
arXiv Detail & Related papers (2023-09-21T19:45:37Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.