Defense Against Prompt Injection Attack by Leveraging Attack Techniques
- URL: http://arxiv.org/abs/2411.00459v2
- Date: Mon, 23 Dec 2024 08:25:54 GMT
- Title: Defense Against Prompt Injection Attack by Leveraging Attack Techniques
- Authors: Yulin Chen, Haoran Li, Zihao Zheng, Yangqiu Song, Dekai Wu, Bryan Hooi,
- Abstract summary: Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.
As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.
Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
- Score: 66.65466992544728
- License:
- Abstract: With the advancement of technology, large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks, powering LLM-integrated applications like Microsoft Copilot. However, as LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. These attacks trick LLMs into deviating from the original input instructions and executing the attacker's instructions injected in data content, such as retrieved results. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content, and achieve a high attack success rate (ASR). When comparing the attack and defense methods, we interestingly find that they share similar design goals, of inducing the model to ignore unwanted instructions and instead to execute wanted instructions. Therefore, we raise an intuitive question: Could these attack techniques be utilized for defensive purposes? In this paper, we invert the intention of prompt injection methods to develop novel defense methods based on previous training-free attack methods, by repeating the attack process but with the original input instruction rather than the injected instruction. Our comprehensive experiments demonstrate that our defense techniques outperform existing training-free defense approaches, achieving state-of-the-art results.
Related papers
- Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks [45.65210717380502]
Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications.
prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire.
This paper introduces a novel test-time defense strategy, named AuThentication with Hash-based tags (FATH)
arXiv Detail & Related papers (2024-10-28T20:02:47Z) - SecAlign: Defending Against Prompt Injection with Preference Optimization [52.48001255555192]
Adrial prompts can be injected into external data sources to override the system's intended instruction and execute a malicious instruction.
We propose a new defense called SecAlign based on the technique of preference optimization.
Our method reduces the success rates of various prompt injections to around 0%, even against attacks much more sophisticated than ones seen during training.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - Self-Evaluation as a Defense Against Adversarial Attacks on LLMs [20.79833694266861]
We introduce a defense against adversarial attacks on LLMs utilizing self-evaluation.
Our method requires no model fine-tuning, instead using pre-trained models to evaluate the inputs and outputs of a generator model.
We present an analysis of the effectiveness of our method, including attempts to attack the evaluator in various settings.
arXiv Detail & Related papers (2024-07-03T16:03:42Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks
Against LLM-Integrated Applications [0.0]
This paper introduces the 'Signed-Prompt' method as a novel solution for prompt injection attacks.
The study involves signing sensitive instructions within command segments by authorized users, enabling the LLM to discern trusted instruction sources.
Experiments demonstrate the effectiveness of the Signed-Prompt method, showing substantial resistance to various types of prompt injection attacks.
arXiv Detail & Related papers (2024-01-15T11:44:18Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.