Learn Zero-Constraint-Violation Policy in Model-Free Constrained
Reinforcement Learning
- URL: http://arxiv.org/abs/2111.12953v1
- Date: Thu, 25 Nov 2021 07:24:30 GMT
- Title: Learn Zero-Constraint-Violation Policy in Model-Free Constrained
Reinforcement Learning
- Authors: Haitong Ma, Changliu Liu, Shengbo Eben Li, Sifa Zheng, Wenchao Sun,
Jianyu Chen
- Abstract summary: We propose the safe set actor-critic (SSAC) algorithm, which confines the policy update using safety-oriented energy functions.
The safety index is designed to increase rapidly for potentially dangerous actions.
We claim that we can learn the energy function in a model-free manner similar to learning a value function.
- Score: 7.138691584246846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the trial-and-error mechanism of reinforcement learning (RL), a notorious
contradiction arises when we expect to learn a safe policy: how to learn a safe
policy without enough data and prior model about the dangerous region? Existing
methods mostly use the posterior penalty for dangerous actions, which means
that the agent is not penalized until experiencing danger. This fact causes
that the agent cannot learn a zero-violation policy even after convergence.
Otherwise, it would not receive any penalty and lose the knowledge about
danger. In this paper, we propose the safe set actor-critic (SSAC) algorithm,
which confines the policy update using safety-oriented energy functions, or the
safety indexes. The safety index is designed to increase rapidly for
potentially dangerous actions, which allows us to locate the safe set on the
action space, or the control safe set. Therefore, we can identify the dangerous
actions prior to taking them, and further obtain a zero constraint-violation
policy after convergence.We claim that we can learn the energy function in a
model-free manner similar to learning a value function. By using the energy
function transition as the constraint objective, we formulate a constrained RL
problem. We prove that our Lagrangian-based solutions make sure that the
learned policy will converge to the constrained optimum under some assumptions.
The proposed algorithm is evaluated on both the complex simulation environments
and a hardware-in-loop (HIL) experiment with a real controller from the
autonomous vehicle. Experimental results suggest that the converged policy in
all environments achieves zero constraint violation and comparable performance
with model-based baselines.
Related papers
- Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning [5.862025534776996]
Reinforcement Learning for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment.
In such methods, if agents are in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized.
We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy.
In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem.
arXiv Detail & Related papers (2024-05-19T20:33:21Z) - Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Safe Deep Reinforcement Learning by Verifying Task-Level Properties [84.64203221849648]
Cost functions are commonly employed in Safe Deep Reinforcement Learning (DRL)
The cost is typically encoded as an indicator function due to the difficulty of quantifying the risk of policy decisions in the state space.
In this paper, we investigate an alternative approach that uses domain knowledge to quantify the risk in the proximity of such states by defining a violation metric.
arXiv Detail & Related papers (2023-02-20T15:24:06Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - Enhancing Safe Exploration Using Safety State Augmentation [71.00929878212382]
We tackle the problem of safe exploration in model-free reinforcement learning.
We derive policies for scheduling the safety budget during training.
We show that Simmer can stabilize training and improve the performance of safe RL with average constraints.
arXiv Detail & Related papers (2022-06-06T15:23:07Z) - SAUTE RL: Almost Surely Safe Reinforcement Learning Using State
Augmentation [63.25418599322092]
Satisfying safety constraints almost surely (or with probability one) can be critical for deployment of Reinforcement Learning (RL) in real-life applications.
We address the problem by introducing Safety Augmented Markov Decision Processes (MDPs)
We show that Saute MDP allows to view Safe augmentation problem from a different perspective enabling new features.
arXiv Detail & Related papers (2022-02-14T08:57:01Z) - Model-based Chance-Constrained Reinforcement Learning via Separated
Proportional-Integral Lagrangian [5.686699342802045]
We propose a separated proportional-integral Lagrangian algorithm to enhance RL safety under uncertainty.
We demonstrate our method can reduce the oscillations and conservatism of RL policy in a car-following simulation.
arXiv Detail & Related papers (2021-08-26T07:34:14Z) - Safe Reinforcement Learning Using Advantage-Based Intervention [45.79740561754542]
Many sequential decision problems involve finding a policy that maximizes total reward while obeying safety constraints.
We propose a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training.
Our method comes with strong guarantees on safety during both training and deployment.
arXiv Detail & Related papers (2021-06-16T20:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.