Related papers: Quantile Constrained Reinforcement Learning: A Reinforcement Learning Framework Constraining Outage Probability

Quantile Constrained Reinforcement Learning: A Reinforcement Learning Framework Constraining Outage Probability

URL: http://arxiv.org/abs/2211.15034v1
Date: Mon, 28 Nov 2022 03:46:56 GMT
Title: Quantile Constrained Reinforcement Learning: A Reinforcement Learning Framework Constraining Outage Probability
Authors: Whiyoung Jung, Myungsik Cho, Jongeui Park, Youngchul Sung
Abstract summary: Constrained reinforcement learning (RL) aims to find an optimal policy that maximizes expected cumulative return while satisfying a given constraint. This paper proposes a framework, named Quantile Constrained RL (QCRL), to constrain the quantile of the distribution of the cumulative sum cost.
Score: 16.861004263551447
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Constrained reinforcement learning (RL) is an area of RL whose objective is to find an optimal policy that maximizes expected cumulative return while satisfying a given constraint. Most of the previous constrained RL works consider expected cumulative sum cost as the constraint. However, optimization with this constraint cannot guarantee a target probability of outage event that the cumulative sum cost exceeds a given threshold. This paper proposes a framework, named Quantile Constrained RL (QCRL), to constrain the quantile of the distribution of the cumulative sum cost that is a necessary and sufficient condition to satisfy the outage constraint. This is the first work that tackles the issue of applying the policy gradient theorem to the quantile and provides theoretical results for approximating the gradient of the quantile. Based on the derived theoretical results and the technique of the Lagrange multiplier, we construct a constrained RL algorithm named Quantile Constrained Policy Optimization (QCPO). We use distributional RL with the Large Deviation Principle (LDP) to estimate quantiles and tail probability of the cumulative sum cost for the implementation of QCPO. The implemented algorithm satisfies the outage probability constraint after the training period.

Related papers

Probabilistic Satisfaction of Temporal Logic Constraints in Reinforcement Learning via Adaptive Policy-Switching [0.0]
Constrained Reinforcement Learning (CRL) is a subset of machine learning that introduces constraints into the traditional reinforcement learning (RL) framework. We propose a novel framework that relies on switching between pure learning (reward) and constraint satisfaction.
arXiv Detail & Related papers (2024-10-10T15:19:45Z)
A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning [35.13329809630571]
We propose Primal-Dual-Critic Algorithm (PDCA), a novel algorithm for offline constrained RL with general function approximation. PDCA runs a primal-dual algorithm on the Lagrangian function estimated by critics. We show that PDCA can successfully find a near saddle point of the Lagrangian, which is nearly optimal for the constrained RL problem.
arXiv Detail & Related papers (2023-06-13T14:50:03Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z)
COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z)
Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality [141.89413461337324]
Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL) We propose a theoretical formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective.
arXiv Detail & Related papers (2022-02-14T01:31:46Z)
Quantile-Based Policy Optimization for Reinforcement Learning [0.0]
We parameterize the policy controlling actions by neural networks and propose a novel policy gradient algorithm called Quantile-Based Policy Optimization (QPO) Our numerical results demonstrate that the proposed algorithms outperform the existing baseline algorithms under the quantile criterion.
arXiv Detail & Related papers (2022-01-27T12:01:36Z)
Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint. We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it. Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z)
A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback [22.17503016665027]
We consider problems where each action returns a random reward, cost, and penalty from an unknown joint distribution. We propose a novel low-complexity algorithm, named $tt LyOn$, and prove that it achieves $O(sqrtBlog B)$ regret and $O(log B/B)$ constraint-violation. The low computational cost bounds of $tt LyOn$ suggest that Lyapunov-based algorithm design methodology can be effective in solving constrained bandit optimization problems.
arXiv Detail & Related papers (2021-06-09T16:12:07Z)
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints. This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.