Towards Safe Reinforcement Learning via Constraining Conditional
Value-at-Risk
- URL: http://arxiv.org/abs/2206.04436v1
- Date: Thu, 9 Jun 2022 11:57:54 GMT
- Title: Towards Safe Reinforcement Learning via Constraining Conditional
Value-at-Risk
- Authors: Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, Jun Zhu
- Abstract summary: We propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold.
Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances.
- Score: 30.229387511344456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though deep reinforcement learning (DRL) has obtained substantial success, it
may encounter catastrophic failures due to the intrinsic uncertainty of both
transition and observation. Most of the existing methods for safe reinforcement
learning can only handle transition disturbance or observation disturbance
since these two kinds of disturbance affect different parts of the agent;
besides, the popular worst-case return may lead to overly pessimistic policies.
To address these issues, we first theoretically prove that the performance
degradation under transition disturbance and observation disturbance depends on
a novel metric of Value Function Range (VFR), which corresponds to the gap in
the value function between the best state and the worst state. Based on the
analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk
and propose a novel reinforcement learning algorithm of
CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive
constrained optimization problem by keeping its CVaR under a given threshold.
Experimental results show that CPPO achieves a higher cumulative reward and is
more robust against both observation and transition disturbances on a series of
continuous control tasks in MuJoCo.
Related papers
- Adversarial Robustness Overestimation and Instability in TRADES [4.063518154926961]
TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task.
This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking.
arXiv Detail & Related papers (2024-10-10T07:32:40Z) - The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks [90.52808174102157]
In safety-critical applications such as medical imaging and autonomous driving, it is imperative to maintain both high adversarial robustness to protect against potential adversarial attacks.
A notable knowledge gap remains concerning the uncertainty inherent in adversarially trained models.
This study investigates the uncertainty of deep learning models by examining the performance of conformal prediction (CP) in the context of standard adversarial attacks.
arXiv Detail & Related papers (2024-05-14T18:05:19Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning [11.084321518414226]
We adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of so-called hindsight policy methods.
Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
arXiv Detail & Related papers (2023-07-21T20:54:52Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Certifying Safety in Reinforcement Learning under Adversarial
Perturbation Attacks [23.907977144668838]
We propose a partially-supervised reinforcement learning (PSRL) framework that takes advantage of an additional assumption that the true state of the POMDP is known at training time.
We present the first approach for certifying safety of PSRL policies under adversarial input perturbations, and two adversarial training approaches that make direct use of PSRL.
arXiv Detail & Related papers (2022-12-28T22:33:38Z) - Benchmarking Safe Deep Reinforcement Learning in Aquatic Navigation [78.17108227614928]
We propose a benchmark environment for Safe Reinforcement Learning focusing on aquatic navigation.
We consider a value-based and policy-gradient Deep Reinforcement Learning (DRL)
We also propose a verification strategy that checks the behavior of the trained models over a set of desired properties.
arXiv Detail & Related papers (2021-12-16T16:53:56Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Learning Robust Feedback Policies from Demonstrations [9.34612743192798]
We propose and analyze a new framework to learn feedback control policies that exhibit provable guarantees on the closed-loop performance and robustness to bounded (adversarial) perturbations.
These policies are learned from expert demonstrations without any prior knowledge of the task, its cost function, and system dynamics.
arXiv Detail & Related papers (2021-03-30T19:11:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.