Related papers: PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement Learning

PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement Learning

URL: http://arxiv.org/abs/2202.03609v5
Date: Thu, 14 Sep 2023 08:15:55 GMT
Title: PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement Learning
Authors: Junfeng Guo, Ang Li, Cong Liu
Abstract summary: We propose the problem of Backdoor Detection in a multi-agent competitive reinforcement learning system. PolicyCleanse is based on the property that the activated Trojan agents accumulated rewards degrade noticeably after several timesteps. Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor.
Score: 19.524789009088245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While real-world applications of reinforcement learning are becoming popular, the security and robustness of RL systems are worthy of more attention and exploration. In particular, recent works have revealed that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. To ensure the security of RL agents against malicious backdoors, in this work, we propose the problem of Backdoor Detection in a multi-agent competitive reinforcement learning system, with the objective of detecting Trojan agents as well as the corresponding potential trigger actions, and further trying to mitigate their Trojan behavior. In order to solve this problem, we propose PolicyCleanse that is based on the property that the activated Trojan agents accumulated rewards degrade noticeably after several timesteps. Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor. Extensive experiments demonstrate that the proposed methods can accurately detect Trojan agents, and outperform existing backdoor mitigation baseline approaches by at least 3% in winning rate across various types of agents and environments.

Related papers

TrojanTO: Action-Level Backdoor Attacks against Trajectory Optimization Models [67.06525001375722]
TrojanTO is the first action-level backdoor attack against TO models.<n>It implants backdoor attacks across diverse tasks and attack objectives with a low attack budget.<n>TrojanTO exhibits broad applicability to DT, GDT, and DC.
arXiv Detail & Related papers (2025-06-15T11:27:49Z)
Your Agent Can Defend Itself against Backdoor Attacks [0.0]
Large language model (LLM)-powered agents face significant security risks from backdoor attacks during training and fine-tuning.<n>We present ReAgent, a novel defense against a range of backdoor attacks on LLM-based agents.
arXiv Detail & Related papers (2025-06-10T01:45:56Z)
DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent [6.82059828237144]
We propose a novel backdoor implantation strategy called textbfDynamically Encrypted Multi-Backdoor Implantation Attack. We introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits. We present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks.
arXiv Detail & Related papers (2025-02-18T06:26:15Z)
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In [5.65782619470663]
We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack. Our experiments show that indirect prompt injection attacks can significantly increase the likelihood of the agent performing subsequent malicious actions. To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution.
arXiv Detail & Related papers (2024-10-22T12:24:41Z)
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z)
A Spatiotemporal Stealthy Backdoor Attack against Cooperative Multi-Agent Deep Reinforcement Learning [12.535344011523897]
cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks. We propose a novel backdoor attack against c-MADRL, which attacks entire multi-agent team by embedding backdoor only in one agent. Our backdoor attacks are able to reach a high attack success rate (91.6%) while maintaining a low clean performance variance rate (3.7%)
arXiv Detail & Related papers (2024-09-12T06:17:37Z)
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits [1.1118610055902116]
We introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties.
arXiv Detail & Related papers (2024-06-03T17:55:41Z)
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources. Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker. Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Recover Triggered States: Protect Model Against Backdoor Attack in Reinforcement Learning [23.94769537680776]
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent. This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks.
arXiv Detail & Related papers (2023-04-01T08:00:32Z)
FreeEagle: Detecting Complex Neural Trojans in Data-Free Cases [50.065022493142116]
Trojan attack on deep neural networks, also known as backdoor attack, is a typical threat to artificial intelligence. FreeEagle is the first data-free backdoor detection method that can effectively detect complex backdoor attacks.
arXiv Detail & Related papers (2023-02-28T11:31:29Z)
An anomaly detection approach for backdoored neural networks: face recognition as a case study [77.92020418343022]
We propose a novel backdoored network detection method based on the principle of anomaly detection. We test our method on a novel dataset of backdoored networks and report detectability results with perfect scores.
arXiv Detail & Related papers (2022-08-22T12:14:13Z)
BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning [80.99426477001619]
We migrate backdoor attacks to more complex RL systems involving multiple agents. As a proof of concept, we demonstrate that an adversary agent can trigger the backdoor of the victim agent with its own action. The results show that when the backdoor is activated, the winning rate of the victim drops by 17% to 37% compared to when not activated.
arXiv Detail & Related papers (2021-05-02T23:47:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.