PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement
Learning
- URL: http://arxiv.org/abs/2202.03609v5
- Date: Thu, 14 Sep 2023 08:15:55 GMT
- Title: PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement
Learning
- Authors: Junfeng Guo, Ang Li, Cong Liu
- Abstract summary: We propose the problem of Backdoor Detection in a multi-agent competitive reinforcement learning system.
PolicyCleanse is based on the property that the activated Trojan agents accumulated rewards degrade noticeably after several timesteps.
Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor.
- Score: 19.524789009088245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While real-world applications of reinforcement learning are becoming popular,
the security and robustness of RL systems are worthy of more attention and
exploration. In particular, recent works have revealed that, in a multi-agent
RL environment, backdoor trigger actions can be injected into a victim agent
(a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it
sees the backdoor trigger action. To ensure the security of RL agents against
malicious backdoors, in this work, we propose the problem of Backdoor Detection
in a multi-agent competitive reinforcement learning system, with the objective
of detecting Trojan agents as well as the corresponding potential trigger
actions, and further trying to mitigate their Trojan behavior. In order to
solve this problem, we propose PolicyCleanse that is based on the property that
the activated Trojan agents accumulated rewards degrade noticeably after
several timesteps. Along with PolicyCleanse, we also design a machine
unlearning-based approach that can effectively mitigate the detected backdoor.
Extensive experiments demonstrate that the proposed methods can accurately
detect Trojan agents, and outperform existing backdoor mitigation baseline
approaches by at least 3% in winning rate across various types of agents and
environments.
Related papers
- Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In [5.65782619470663]
We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack.
Our experiments show that indirect prompt injection attacks can significantly increase the likelihood of the agent performing subsequent malicious actions.
To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution.
arXiv Detail & Related papers (2024-10-22T12:24:41Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - A Spatiotemporal Stealthy Backdoor Attack against Cooperative Multi-Agent Deep Reinforcement Learning [12.535344011523897]
cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks.
We propose a novel backdoor attack against c-MADRL, which attacks entire multi-agent team by embedding backdoor only in one agent.
Our backdoor attacks are able to reach a high attack success rate (91.6%) while maintaining a low clean performance variance rate (3.7%)
arXiv Detail & Related papers (2024-09-12T06:17:37Z) - Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits [1.1118610055902116]
We introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature.
Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment.
We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties.
arXiv Detail & Related papers (2024-06-03T17:55:41Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Recover Triggered States: Protect Model Against Backdoor Attack in
Reinforcement Learning [23.94769537680776]
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent.
This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks.
arXiv Detail & Related papers (2023-04-01T08:00:32Z) - FreeEagle: Detecting Complex Neural Trojans in Data-Free Cases [50.065022493142116]
Trojan attack on deep neural networks, also known as backdoor attack, is a typical threat to artificial intelligence.
FreeEagle is the first data-free backdoor detection method that can effectively detect complex backdoor attacks.
arXiv Detail & Related papers (2023-02-28T11:31:29Z) - An anomaly detection approach for backdoored neural networks: face
recognition as a case study [77.92020418343022]
We propose a novel backdoored network detection method based on the principle of anomaly detection.
We test our method on a novel dataset of backdoored networks and report detectability results with perfect scores.
arXiv Detail & Related papers (2022-08-22T12:14:13Z) - BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning [80.99426477001619]
We migrate backdoor attacks to more complex RL systems involving multiple agents.
As a proof of concept, we demonstrate that an adversary agent can trigger the backdoor of the victim agent with its own action.
The results show that when the backdoor is activated, the winning rate of the victim drops by 17% to 37% compared to when not activated.
arXiv Detail & Related papers (2021-05-02T23:47:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.