Reward Certification for Policy Smoothed Reinforcement Learning
- URL: http://arxiv.org/abs/2312.06436v2
- Date: Tue, 12 Dec 2023 12:19:31 GMT
- Title: Reward Certification for Policy Smoothed Reinforcement Learning
- Authors: Ronghui Mu, Leandro Soriano Marcolino, Tianle Zhang, Yanghao Zhang,
Xiaowei Huang, Wenjie Ruan
- Abstract summary: Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas.
Recent studies have introduced "smoothed policies" in order to enhance its robustness.
It is still challenging to establish a provable guarantee to certify the bound of its total reward.
- Score: 14.804252729195513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) has achieved remarkable success in
safety-critical areas, but it can be weakened by adversarial attacks. Recent
studies have introduced "smoothed policies" in order to enhance its robustness.
Yet, it is still challenging to establish a provable guarantee to certify the
bound of its total reward. Prior methods relied primarily on computing bounds
using Lipschitz continuity or calculating the probability of cumulative reward
above specific thresholds. However, these techniques are only suited for
continuous perturbations on the RL agent's observations and are restricted to
perturbations bounded by the $l_2$-norm. To address these limitations, this
paper proposes a general black-box certification method capable of directly
certifying the cumulative reward of the smoothed policy under various
$l_p$-norm bounded perturbations. Furthermore, we extend our methodology to
certify perturbations on action spaces. Our approach leverages f-divergence to
measure the distinction between the original distribution and the perturbed
distribution, subsequently determining the certification bound by solving a
convex optimisation problem. We provide a comprehensive theoretical analysis
and run sufficient experiments in multiple environments. Our results show that
our method not only improves the certified lower bound of mean cumulative
reward but also demonstrates better efficiency than state-of-the-art
techniques.
Related papers
- Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.
We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z) - Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning [12.721239079824622]
We propose a safe reinforcement learning (RL) paradigm that enables a higher level of safety without any expectation-form approximations.
A tilted update strategy for quantile gradients is implemented to compensate the asymmetric distributional density.
Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return.
arXiv Detail & Related papers (2024-12-17T18:58:00Z) - Off-Policy Primal-Dual Safe Reinforcement Learning [16.918188277722503]
We show that the error in cumulative cost estimation causes significant underestimation of cost when using off-policy methods.
We propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in estimation.
We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty.
arXiv Detail & Related papers (2024-01-26T10:33:38Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Anti-Exploration by Random Network Distillation [63.04360288089277]
We show that a naive choice of conditioning for the Random Network Distillation (RND) is not discriminative enough to be used as an uncertainty estimator.
We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM)
We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
arXiv Detail & Related papers (2023-01-31T13:18:33Z) - Certifying Safety in Reinforcement Learning under Adversarial
Perturbation Attacks [23.907977144668838]
We propose a partially-supervised reinforcement learning (PSRL) framework that takes advantage of an additional assumption that the true state of the POMDP is known at training time.
We present the first approach for certifying safety of PSRL policies under adversarial input perturbations, and two adversarial training approaches that make direct use of PSRL.
arXiv Detail & Related papers (2022-12-28T22:33:38Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - CROP: Certifying Robust Policies for Reinforcement Learning through
Functional Smoothing [41.093241772796475]
We present the first framework of Certifying Robust Policies for reinforcement learning (CROP) against adversarial state perturbations.
We propose two types of robustness certification criteria: robustness of per-state actions and lower bound of cumulative rewards.
arXiv Detail & Related papers (2021-06-17T07:58:32Z) - Certified Distributional Robustness on Smoothed Classifiers [27.006844966157317]
We propose the worst-case adversarial loss over input distributions as a robustness certificate.
By exploiting duality and the smoothness property, we provide an easy-to-compute upper bound as a surrogate for the certificate.
arXiv Detail & Related papers (2020-10-21T13:22:25Z) - Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial
Imitation Learning [52.50288418639075]
We consider the case of off-policy generative adversarial imitation learning.
We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well.
arXiv Detail & Related papers (2020-06-28T20:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.