SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents
- URL: http://arxiv.org/abs/2405.20539v2
- Date: Mon, 21 Oct 2024 16:44:58 GMT
- Title: SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents
- Authors: Ethan Rathbun, Christopher Amato, Alina Oprea,
- Abstract summary: Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications.
In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning.
We formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy.
- Score: 16.350898218047405
- License:
- Abstract: Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications -- making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.
Related papers
- Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning [16.350898218047405]
We propose a new class of backdoor attacks against Deep Reinforcement Learning (DRL) algorithms.
These attacks achieve state of the art performance while minimally altering the agent's rewards.
We then devise an online attack which significantly out-performs prior attacks under bounded reward constraints.
arXiv Detail & Related papers (2024-10-17T19:50:28Z) - Revisiting Backdoor Attacks against Large Vision-Language Models [76.42014292255944]
This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs.
We modify existing backdoor attacks based on the above key observations.
This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs.
arXiv Detail & Related papers (2024-06-27T02:31:03Z) - Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations [7.361316528368866]
This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks.
By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly.
Experimental results on a 152-host example network confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-06-25T14:16:40Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - BadRL: Sparse Targeted Backdoor Attack Against Reinforcement Learning [37.19070609394519]
Backdoor attacks in reinforcement learning (RL) have previously employed intense attack strategies to ensure attack success.
In this work, we propose a novel approach, BadRL, which focuses on conducting highly sparse backdoor poisoning efforts during training and testing.
Our algorithm, BadRL, strategically chooses state observations with high attack values to inject triggers during training and testing, thereby reducing the chances of detection.
arXiv Detail & Related papers (2023-12-19T20:29:29Z) - ReRoGCRL: Representation-based Robustness in Goal-Conditioned
Reinforcement Learning [29.868059421372244]
Goal-Conditioned Reinforcement Learning (GCRL) has gained attention, but its algorithmic robustness against adversarial perturbations remains unexplored.
We first propose the Semi-Contrastive Representation attack, inspired by the adversarial contrastive attack.
We then introduce Adversarial Representation Tactics, which combines Semi-Contrastive Adversarial Augmentation with Sensitivity-Aware Regularizer.
arXiv Detail & Related papers (2023-12-12T16:05:55Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning [6.414910263179327]
We study reward poisoning attacks on online deep reinforcement learning (DRL)
We demonstrate the intrinsic vulnerability of state-of-the-art DRL algorithms by designing a general, black-box reward poisoning framework called adversarial MDP attacks.
Our results show that our attacks efficiently poison agents learning in several popular classical control and MuJoCo environments.
arXiv Detail & Related papers (2022-05-30T04:07:19Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z) - Guided Adversarial Attack for Evaluating and Enhancing Adversarial
Defenses [59.58128343334556]
We introduce a relaxation term to the standard loss, that finds more suitable gradient-directions, increases attack efficacy and leads to more efficient adversarial training.
We propose Guided Adversarial Margin Attack (GAMA), which utilizes function mapping of the clean image to guide the generation of adversaries.
We also propose Guided Adversarial Training (GAT), which achieves state-of-the-art performance amongst single-step defenses.
arXiv Detail & Related papers (2020-11-30T16:39:39Z) - Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs.
We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.