Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
- URL: http://arxiv.org/abs/2510.03705v1
- Date: Sat, 04 Oct 2025 07:11:11 GMT
- Title: Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
- Authors: Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi,
- Abstract summary: Large language models (LLMs) are vulnerable to prompt injection attacks.<n>In this paper, we explore more vicious attacks that nullify the prompt injection defense methods.<n> backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks.
- Score: 95.54363609024847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore more vicious attacks that nullify the prompt injection defense methods, even the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. We construct a benchmark for comprehensive evaluation. Our experiments demonstrate that backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.
Related papers
- Defense Against Indirect Prompt Injection via Tool Result Parsing [5.69701430275527]
LLM agents face an escalating threat from indirect prompt injection.<n>This vulnerability poses a significant risk as agents gain more direct control over physical environments.<n>We propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code.
arXiv Detail & Related papers (2026-01-08T10:21:56Z) - TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [92.26240528996443]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z) - To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt [5.8935359767204805]
We propose a novel, lightweight defense mechanism called Polymorphic Prompt Assembling.<n>The approach is based on the insight that prompt injection requires guessing and breaking the structure of the system prompt.<n> PPA prevents attackers from predicting the prompt structure, thereby enhancing security without compromising performance.
arXiv Detail & Related papers (2025-06-06T04:50:57Z) - Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z) - UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models [30.139590566956077]
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks.<n>We propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs.
arXiv Detail & Related papers (2025-02-18T18:59:00Z) - MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z) - Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.<n>As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.<n>Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z) - Formalizing and Benchmarking Prompt Injection Attacks and Defenses [59.57908526441172]
We propose a framework to formalize prompt injection attacks.
Based on our framework, we design a new attack by combining existing ones.
Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses.
arXiv Detail & Related papers (2023-10-19T15:12:09Z) - Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in
Language Models [41.1058288041033]
We propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt.
Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack.
arXiv Detail & Related papers (2023-05-02T06:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.