Defending Against Prompt Injection With a Few DefensiveTokens
- URL: http://arxiv.org/abs/2507.07974v1
- Date: Thu, 10 Jul 2025 17:51:05 GMT
- Title: Defending Against Prompt Injection With a Few DefensiveTokens
- Authors: Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner,
- Abstract summary: Large language model (LLM) systems interact with external data to perform complex tasks.<n>By injecting instructions into the data accessed by the system, an attacker can override the initial user task with an arbitrary task directed by the attacker.<n>Test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner.<n>We propose DefensiveToken, a test-time defense with prompt injection comparable to training-time alternatives.
- Score: 53.7493897456957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When large language model (LLM) systems interact with external data to perform complex tasks, a new attack, namely prompt injection, becomes a significant threat. By injecting instructions into the data accessed by the system, the attacker is able to override the initial user task with an arbitrary task directed by the attacker. To secure the system, test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner. However, they are much less effective than training-time defenses that change the model parameters. Motivated by this, we propose DefensiveToken, a test-time defense with prompt injection robustness comparable to training-time alternatives. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a few DefensiveTokens before the LLM input to achieve security with a minimal utility drop. In scenarios where security is less of a concern, developers can simply skip DefensiveTokens; the LLM system remains the same as there is no defense, generating high-quality responses. Thus, DefensiveTokens, if released alongside the model, allow a flexible switch between the state-of-the-art (SOTA) utility and almost-SOTA security at test time. The code is available at https://github.com/Sizhe-Chen/DefensiveToken.
Related papers
- Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks [14.307668562901263]
A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data.<n>We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks.
arXiv Detail & Related papers (2025-07-10T04:20:53Z) - To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt [5.8935359767204805]
We propose a novel, lightweight defense mechanism called Polymorphic Prompt Assembling.<n>The approach is based on the insight that prompt injection requires guessing and breaking the structure of the system prompt.<n> PPA prevents attackers from predicting the prompt structure, thereby enhancing security without compromising performance.
arXiv Detail & Related papers (2025-06-06T04:50:57Z) - One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z) - LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts.<n>We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z) - Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.<n>As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.<n>Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z) - FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks [45.65210717380502]
Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications.
prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire.
This paper introduces a novel test-time defense strategy, named AuThentication with Hash-based tags (FATH)
arXiv Detail & Related papers (2024-10-28T20:02:47Z) - SecAlign: Defending Against Prompt Injection with Preference Optimization [52.48001255555192]
Adversarial prompts can be injected into external data sources to override the system's intended instruction and execute a malicious instruction.<n>We propose a new defense called SecAlign based on the technique of preference optimization.<n>Our method reduces the success rates of various prompt injections to 10%, even against attacks much more sophisticated than ones seen during training.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.