Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
- URL: http://arxiv.org/abs/2602.05089v2
- Date: Mon, 09 Feb 2026 17:46:50 GMT
- Title: Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
- Authors: Ethan Rathbun, Wo Wei Lin, Alina Oprea, Christopher Amato,
- Abstract summary: Reinforcement Learning (RL) allows practitioners and researchers to train decision making agents without running expensive experiments on real hardware.<n>Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware.<n>In this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents.
- Score: 14.770953004784921
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined ``trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack ``Daze'' which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.
Related papers
- Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning [2.8311497176067104]
Deep Reinforcement Learning (DRL) systems are increasingly used in safety-critical applications, yet their security remains severely underexplored.<n>This work investigates backdoor attacks, which implant hidden triggers that cause malicious actions only when specific inputs appear in the observation space.<n>We introduce two novel attacks: (1) TrojanentRL, which exploits component-level flaws to implant a persistent backdoor that survives full model retraining; and (2) InfrectroRL, a post-training backdoor attack which requires no access to training, validation, nor test data.
arXiv Detail & Related papers (2025-07-07T11:15:54Z) - Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations [7.361316528368866]
This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks.
By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly.
Experimental results on a 152-host example network confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-06-25T14:16:40Z) - SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents [16.350898218047405]
Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications.
In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning.
We formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy.
arXiv Detail & Related papers (2024-05-30T23:31:25Z) - Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models.
In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned.
We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - Recover Triggered States: Protect Model Against Backdoor Attack in
Reinforcement Learning [23.94769537680776]
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent.
This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks.
arXiv Detail & Related papers (2023-04-01T08:00:32Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - On the Effectiveness of Adversarial Training against Backdoor Attacks [111.8963365326168]
A backdoored model always predicts a target class in the presence of a predefined trigger pattern.
In general, adversarial training is believed to defend against backdoor attacks.
We propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.
arXiv Detail & Related papers (2022-02-22T02:24:46Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning [80.99426477001619]
We migrate backdoor attacks to more complex RL systems involving multiple agents.
As a proof of concept, we demonstrate that an adversary agent can trigger the backdoor of the victim agent with its own action.
The results show that when the backdoor is activated, the winning rate of the victim drops by 17% to 37% compared to when not activated.
arXiv Detail & Related papers (2021-05-02T23:47:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.