PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
- URL: http://arxiv.org/abs/2507.00485v1
- Date: Tue, 01 Jul 2025 06:59:59 GMT
- Title: PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
- Authors: Weiran Guo, Guanjun Liu, Ziyuan Zhou, Ling Wang,
- Abstract summary: Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards.<n>We identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions.<n>This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks.
- Score: 7.1572446944905375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we conduct experiments to evaluate the effectiveness of our proposed backdoor attack framework, evaluating it with the established metrics. This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks. Our code and supplementary material are available at https://github.com/azure-123/PNAct.
Related papers
- Coward: Toward Practical Proactive Federated Backdoor Defense via Collision-based Watermark [90.94234374893287]
We introduce a new proactive defense, dubbed Coward, inspired by our discovery of multi-backdoor collision effects.<n>In general, we detect attackers by evaluating whether the server-injected, conflicting global watermark is erased during local training rather than retained.
arXiv Detail & Related papers (2025-08-04T06:51:33Z) - Fox in the Henhouse: Supply-Chain Backdoor Attacks Against Reinforcement Learning [20.41701122824956]
Current state-of-the-art backdoor attacks against Reinforcement Learning (RL) rely upon unrealistically permissive access models.<n>We propose the underlineSupply-underlineChunderlineain underlineBackdoor (SCAB) attack.<n>Our attack can successfully activate over $90%$ of triggered actions, reducing the average episodic return by $80%$ for the victim.
arXiv Detail & Related papers (2025-05-26T05:39:35Z) - DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent [6.82059828237144]
We propose a novel backdoor implantation strategy called textbfDynamically Encrypted Multi-Backdoor Implantation Attack.<n>We introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits.<n>We present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks.
arXiv Detail & Related papers (2025-02-18T06:26:15Z) - UNIDOOR: A Universal Framework for Action-Level Backdoor Attacks in Deep Reinforcement Learning [29.276629583642002]
Action-level backdoors pose significant threats through precise manipulation and flexible activation.<n>This paper proposes the first universal action-level backdoor attack framework, called UNIDOOR.
arXiv Detail & Related papers (2025-01-26T13:43:39Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents [16.350898218047405]
Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications.
In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning.
We formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy.
arXiv Detail & Related papers (2024-05-30T23:31:25Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - Recover Triggered States: Protect Model Against Backdoor Attack in
Reinforcement Learning [23.94769537680776]
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent.
This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks.
arXiv Detail & Related papers (2023-04-01T08:00:32Z) - Provable Safe Reinforcement Learning with Binary Feedback [62.257383728544006]
We consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs.
We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting.
arXiv Detail & Related papers (2022-10-26T05:37:51Z) - Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs.
We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.