Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain
- URL: http://arxiv.org/abs/2510.05159v2
- Date: Tue, 14 Oct 2025 15:35:32 GMT
- Title: Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain
- Authors: Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Nicolas Chapados, Quentin Cappart, Alexandre Lacoste, Krishnamurthy Dj Dvijotham, Alexandre Drouin,
- Abstract summary: Fine-tuning AI agents on data from their own interactions introduces a critical security vulnerability within the AI supply chain.<n>We show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors.
- Score: 82.98626829232899
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.
Related papers
- Revisiting Backdoor Threat in Federated Instruction Tuning from a Signal Aggregation Perspective [19.40077533912822]
This paper investigates a more pervasive and insidious threat: textitbackdoor vulnerabilities from low-concentration poisoned data distributed across datasets of benign clients.<n>Our findings highlight an urgent need for new defense mechanisms tailored to the realities of modern, decentralized data ecosystems.
arXiv Detail & Related papers (2026-02-17T15:54:45Z) - AutoBackdoor: Automating Backdoor Attacks via LLM Agents [35.216857373810875]
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs)<n>In this work, we introduce textscAutoBackdoor, a general framework for automating backdoor injection.<n>Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases.
arXiv Detail & Related papers (2025-11-20T03:58:54Z) - Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE [64.47951172662745]
Cuckoo Attack is a novel attack that achieves stealthy and persistent command execution by embedding malicious payloads into configuration files.<n>We formalize our attack paradigm into two stages, including initial infection and persistence.<n>We contribute seven actionable checkpoints for vendors to evaluate their product security.
arXiv Detail & Related papers (2025-09-19T04:10:52Z) - BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z) - DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion [0.7351161122478707]
Deep neural networks are vulnerable to Trojan (backdoor) attacks.<n> triggerAdaptive inversion reconstructs malicious "shortcut" patterns inserted by an adversary during training.<n>We propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance.
arXiv Detail & Related papers (2025-07-30T16:31:13Z) - Poison Once, Control Anywhere: Clean-Text Visual Backdoors in VLM-based Mobile Agents [54.35629963816521]
This work introduces VIBMA, the first clean-text backdoor attack targeting VLM-based mobile agents.<n>The attack injects malicious behaviors into the model by modifying only the visual input.<n>We show that our attack achieves high success rates while preserving clean-task behavior.
arXiv Detail & Related papers (2025-06-16T08:09:32Z) - Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents [36.49717045080722]
This paper investigates the vulnerabilities of AI agents within blockchain-based financial ecosystems when exposed to adversarial threats in real-world scenarios.<n>We introduce the concept of context manipulation -- a comprehensive attack vector that exploits unprotected context surfaces.<n>Using ElizaOS, we showcase that malicious injections into prompts or historical records can trigger unauthorized asset transfers and protocol violations.
arXiv Detail & Related papers (2025-03-20T15:44:31Z) - AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.