Provable Defense against Backdoor Policies in Reinforcement Learning
- URL: http://arxiv.org/abs/2211.10530v1
- Date: Fri, 18 Nov 2022 23:12:24 GMT
- Title: Provable Defense against Backdoor Policies in Reinforcement Learning
- Authors: Shubham Kumar Bharti, Xuezhou Zhang, Adish Singla, Xiaojin Zhu
- Abstract summary: A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers.
We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption.
- Score: 35.908468039596734
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a provable defense mechanism against backdoor policies in
reinforcement learning under subspace trigger assumption. A backdoor policy is
a security threat where an adversary publishes a seemingly well-behaved policy
which in fact allows hidden triggers. During deployment, the adversary can
modify observed states in a particular way to trigger unexpected actions and
harm the agent. We assume the agent does not have the resources to re-train a
good policy. Instead, our defense mechanism sanitizes the backdoor policy by
projecting observed states to a 'safe subspace', estimated from a small number
of interactions with a clean (non-triggered) environment. Our sanitized policy
achieves $\epsilon$ approximate optimality in the presence of triggers,
provided the number of clean interactions is $O\left(\frac{D}{(1-\gamma)^4
\epsilon^2}\right)$ where $\gamma$ is the discounting factor and $D$ is the
dimension of state space. Empirically, we show that our sanitization defense
performs well on two Atari game environments.
Related papers
- Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models.
In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned.
We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z) - Cooperative Backdoor Attack in Decentralized Reinforcement Learning with Theoretical Guarantee [21.596629203866925]
The paper investigates a cooperative backdoor attack in a decentralized reinforcement learning scenario.
Our method decomposes the backdoor behavior into multiple components according to the state space of RL.
To the best of our knowledge, this is the first paper presenting a provable cooperative backdoor attack in decentralized reinforcement learning.
arXiv Detail & Related papers (2024-05-24T06:13:31Z) - Preventing Reward Hacking with Occupancy Measure Regularization [13.02511938180832]
Reward hacking occurs when an agent performs poorly with respect to the unknown true reward.
We propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking.
arXiv Detail & Related papers (2024-03-05T18:22:15Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Rethinking Adversarial Policies: A Generalized Attack Formulation and
Provable Defense in RL [46.32591437241358]
In this paper, we consider a multi-agent setting where a well-trained victim agent is exploited by an attacker controlling another agent.
Previous models do not account for the possibility that the attacker may only have partial control over $alpha$ or that the attack may produce easily detectable "abnormal" behaviors.
We introduce a generalized attack framework that has the flexibility to model what extent the adversary is able to control the agent.
We offer a provably efficient defense with convergence to the most robust victim policy through adversarial training with timescale separation.
arXiv Detail & Related papers (2023-05-27T02:54:07Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - Recover Triggered States: Protect Model Against Backdoor Attack in
Reinforcement Learning [23.94769537680776]
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent.
This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks.
arXiv Detail & Related papers (2023-04-01T08:00:32Z) - Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS)
DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z) - Defense Against Reward Poisoning Attacks in Reinforcement Learning [29.431349181232203]
We study defense strategies against reward poisoning attacks in reinforcement learning.
We propose an optimization framework for deriving optimal defense policies.
We show that defense policies that are solutions to the proposed optimization problems have provable performance guarantees.
arXiv Detail & Related papers (2021-02-10T23:31:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.