Your Policy Regularizer is Secretly an Adversary
- URL: http://arxiv.org/abs/2203.12592v2
- Date: Thu, 24 Mar 2022 17:59:01 GMT
- Title: Your Policy Regularizer is Secretly an Adversary
- Authors: Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Gr\'egoire Del\'etang,
Markus Kunesch, Shane Legg, Pedro Ortega
- Abstract summary: We show how robustness arises from hedging against worst-case perturbations of the reward function.
We characterize this robust set of adversarial reward perturbations under KL and alpha-divergence regularization.
We provide detailed discussion of the worst-case reward perturbations, and present intuitive empirical examples to illustrate this robustness.
- Score: 13.625408555732752
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Policy regularization methods such as maximum entropy regularization are
widely used in reinforcement learning to improve the robustness of a learned
policy. In this paper, we show how this robustness arises from hedging against
worst-case perturbations of the reward function, which are chosen from a
limited set by an imagined adversary. Using convex duality, we characterize
this robust set of adversarial reward perturbations under KL and
alpha-divergence regularization, which includes Shannon and Tsallis entropy
regularization as special cases. Importantly, generalization guarantees can be
given within this robust set. We provide detailed discussion of the worst-case
reward perturbations, and present intuitive empirical examples to illustrate
this robustness and its relationship with generalization. Finally, we discuss
how our analysis complements and extends previous results on adversarial reward
robustness and path consistency optimality conditions.
Related papers
- Regularization for Adversarial Robust Learning [18.46110328123008]
We develop a novel approach to adversarial training that integrates $phi$-divergence regularization into the distributionally robust risk function.
This regularization brings a notable improvement in computation compared with the original formulation.
We validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.
arXiv Detail & Related papers (2024-08-19T03:15:41Z) - Domain Generalization without Excess Empirical Risk [83.26052467843725]
A common approach is designing a data-driven surrogate penalty to capture generalization and minimize the empirical risk jointly with the penalty.
We argue that a significant failure mode of this recipe is an excess risk due to an erroneous penalty or hardness in joint optimization.
We present an approach that eliminates this problem. Instead of jointly minimizing empirical risk with the penalty, we minimize the penalty under the constraint of optimality of the empirical risk.
arXiv Detail & Related papers (2023-08-30T08:46:46Z) - Generalised Likelihood Ratio Testing Adversaries through the
Differential Privacy Lens [69.10072367807095]
Differential Privacy (DP) provides tight upper bounds on the capabilities of optimal adversaries.
We relax the assumption of a Neyman--Pearson (NPO) adversary to a Generalized Likelihood Test (GLRT) adversary.
This mild relaxation leads to improved privacy guarantees.
arXiv Detail & Related papers (2022-10-24T08:24:10Z) - On the Importance of Gradient Norm in PAC-Bayesian Bounds [92.82627080794491]
We propose a new generalization bound that exploits the contractivity of the log-Sobolev inequalities.
We empirically analyze the effect of this new loss-gradient norm term on different neural architectures.
arXiv Detail & Related papers (2022-10-12T12:49:20Z) - Adversarial Robustness with Semi-Infinite Constrained Learning [177.42714838799924]
Deep learning to inputs perturbations has raised serious questions about its use in safety-critical domains.
We propose a hybrid Langevin Monte Carlo training approach to mitigate this issue.
We show that our approach can mitigate the trade-off between state-of-the-art performance and robust robustness.
arXiv Detail & Related papers (2021-10-29T13:30:42Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Regularized Policies are Reward Robust [33.05828095421357]
We study the effects of regularization of policies in Reinforcement Learning (RL)
We find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward.
Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large.
arXiv Detail & Related papers (2021-01-18T11:38:47Z) - Online and Distribution-Free Robustness: Regression and Contextual
Bandits with Huber Contamination [29.85468294601847]
We revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits.
We show that our algorithms succeed where conventional methods fail.
arXiv Detail & Related papers (2020-10-08T17:59:05Z) - On the generalization of bayesian deep nets for multi-class
classification [27.39403411896995]
We propose a new generalization bound for Bayesian deep nets by exploiting the contractivity of the Log-Sobolev inequalities.
Using these inequalities adds an additional loss-gradient norm term to the generalization bound, which is intuitively a surrogate of the model complexity.
arXiv Detail & Related papers (2020-02-23T09:05:03Z) - Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.