PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement
Learning
- URL: http://arxiv.org/abs/2202.03609v5
- Date: Thu, 14 Sep 2023 08:15:55 GMT
- Title: PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement
Learning
- Authors: Junfeng Guo, Ang Li, Cong Liu
- Abstract summary: We propose the problem of Backdoor Detection in a multi-agent competitive reinforcement learning system.
PolicyCleanse is based on the property that the activated Trojan agents accumulated rewards degrade noticeably after several timesteps.
Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor.
- Score: 19.524789009088245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While real-world applications of reinforcement learning are becoming popular,
the security and robustness of RL systems are worthy of more attention and
exploration. In particular, recent works have revealed that, in a multi-agent
RL environment, backdoor trigger actions can be injected into a victim agent
(a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it
sees the backdoor trigger action. To ensure the security of RL agents against
malicious backdoors, in this work, we propose the problem of Backdoor Detection
in a multi-agent competitive reinforcement learning system, with the objective
of detecting Trojan agents as well as the corresponding potential trigger
actions, and further trying to mitigate their Trojan behavior. In order to
solve this problem, we propose PolicyCleanse that is based on the property that
the activated Trojan agents accumulated rewards degrade noticeably after
several timesteps. Along with PolicyCleanse, we also design a machine
unlearning-based approach that can effectively mitigate the detected backdoor.
Extensive experiments demonstrate that the proposed methods can accurately
detect Trojan agents, and outperform existing backdoor mitigation baseline
approaches by at least 3% in winning rate across various types of agents and
environments.
Related papers
- Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits [1.1118610055902116]
We introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature.
Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment.
We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties.
arXiv Detail & Related papers (2024-06-03T17:55:41Z) - Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning [4.623498459985644]
Neural backdoors pose a serious security threat as they allow attackers to maliciously alter model behavior.
In this work, we introduce a novel method for comprehensive and effective elimination of backdoors, called ULRL.
arXiv Detail & Related papers (2024-05-23T16:49:09Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning [49.174341192722615]
Backdoor attack poses a significant security threat to Deep Learning applications.
Recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions.
We introduce a novel backdoor attack LOTUS to address both evasiveness and resilience.
arXiv Detail & Related papers (2024-03-25T21:01:29Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Backdoor Mitigation by Correcting the Distribution of Neural Activations [30.554700057079867]
Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs)
We analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances.
We propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration.
arXiv Detail & Related papers (2023-08-18T22:52:29Z) - Recover Triggered States: Protect Model Against Backdoor Attack in
Reinforcement Learning [23.94769537680776]
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent.
This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks.
arXiv Detail & Related papers (2023-04-01T08:00:32Z) - FreeEagle: Detecting Complex Neural Trojans in Data-Free Cases [50.065022493142116]
Trojan attack on deep neural networks, also known as backdoor attack, is a typical threat to artificial intelligence.
FreeEagle is the first data-free backdoor detection method that can effectively detect complex backdoor attacks.
arXiv Detail & Related papers (2023-02-28T11:31:29Z) - An anomaly detection approach for backdoored neural networks: face
recognition as a case study [77.92020418343022]
We propose a novel backdoored network detection method based on the principle of anomaly detection.
We test our method on a novel dataset of backdoored networks and report detectability results with perfect scores.
arXiv Detail & Related papers (2022-08-22T12:14:13Z) - BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning [80.99426477001619]
We migrate backdoor attacks to more complex RL systems involving multiple agents.
As a proof of concept, we demonstrate that an adversary agent can trigger the backdoor of the victim agent with its own action.
The results show that when the backdoor is activated, the winning rate of the victim drops by 17% to 37% compared to when not activated.
arXiv Detail & Related papers (2021-05-02T23:47:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.