Efficient Off-Policy Safe Reinforcement Learning Using Trust Region
Conditional Value at Risk
- URL: http://arxiv.org/abs/2312.00342v1
- Date: Fri, 1 Dec 2023 04:29:19 GMT
- Title: Efficient Off-Policy Safe Reinforcement Learning Using Trust Region
Conditional Value at Risk
- Authors: Dohyeong Kim and Songhwai Oh
- Abstract summary: An on-policy safe RL method, called TRC, deals with a CVaR-constrained RL problem using a trust region method.
To achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods are required to be sample efficient.
We propose novel surrogate functions, in which the effect of the distributional shift can be reduced, and introduce an adaptive trust-region constraint to ensure a policy not to deviate far from replay buffers.
- Score: 16.176812250762666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper aims to solve a safe reinforcement learning (RL) problem with risk
measure-based constraints. As risk measures, such as conditional value at risk
(CVaR), focus on the tail distribution of cost signals, constraining risk
measures can effectively prevent a failure in the worst case. An on-policy safe
RL method, called TRC, deals with a CVaR-constrained RL problem using a trust
region method and can generate policies with almost zero constraint violations
with high returns. However, to achieve outstanding performance in complex
environments and satisfy safety constraints quickly, RL methods are required to
be sample efficient. To this end, we propose an off-policy safe RL method with
CVaR constraints, called off-policy TRC. If off-policy data from replay buffers
is directly used to train TRC, the estimation error caused by the
distributional shift results in performance degradation. To resolve this issue,
we propose novel surrogate functions, in which the effect of the distributional
shift can be reduced, and introduce an adaptive trust-region constraint to
ensure a policy not to deviate far from replay buffers. The proposed method has
been evaluated in simulation and real-world environments and satisfied safety
constraints within a few steps while achieving high returns even in complex
robotic tasks.
Related papers
- Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to unsafe behaviors.
We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints.
arXiv Detail & Related papers (2024-11-05T09:55:50Z) - TRC: Trust Region Conditional Value at Risk for Safe Reinforcement
Learning [16.176812250762666]
We propose a trust region-based safe RL method with CVaR constraints, called TRC.
We first derive the upper bound on CVaR and then approximate the upper bound in a differentiable form in a trust region.
Compared to other safe RL methods, the performance is improved by 1.93 times while the constraints are satisfied in all experiments.
arXiv Detail & Related papers (2023-12-01T04:40:47Z) - Provably Efficient Iterated CVaR Reinforcement Learning with Function
Approximation and Human Feedback [57.6775169085215]
Risk-sensitive reinforcement learning aims to optimize policies that balance the expected reward and risk.
We present a novel framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations.
We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis.
arXiv Detail & Related papers (2023-07-06T08:14:54Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - Safe Model-Based Reinforcement Learning with an Uncertainty-Aware
Reachability Certificate [6.581362609037603]
We build a safe reinforcement learning framework to resolve constraints required by the DRC and its corresponding shield policy.
We also devise a line search method to maintain safety and reach higher returns simultaneously while leveraging the shield policy.
arXiv Detail & Related papers (2022-10-14T06:16:53Z) - Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement
Learning in Unknown Stochastic Environments [84.3830478851369]
We propose a safe reinforcement learning approach that can jointly learn the environment and optimize the control policy.
Our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.
arXiv Detail & Related papers (2022-09-29T20:49:25Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Lyapunov Barrier Policy Optimization [15.364174084072872]
We propose a new method, LBPO, that uses a Lyapunov-based barrier function to restrict the policy update to a safe set for each training iteration.
Our method also allows the user to control the conservativeness of the agent with respect to the constraints in the environment.
arXiv Detail & Related papers (2021-03-16T17:58:27Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.