Related papers: Maatphor: Automated Variant Analysis for Prompt Injection Attacks

Maatphor: Automated Variant Analysis for Prompt Injection Attacks

URL: http://arxiv.org/abs/2312.11513v1
Date: Tue, 12 Dec 2023 14:22:20 GMT
Title: Maatphor: Automated Variant Analysis for Prompt Injection Attacks
Authors: Ahmed Salem and Andrew Paverd and Boris K\"opf
Abstract summary: Current best-practice for defending against prompt injection techniques is to add additional guardrails to the system. We present a tool to assist defenders in performing automated variant analysis of known prompt injection attacks.
Score: 7.93367270029538
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt injection has emerged as a serious security threat to large language models (LLMs). At present, the current best-practice for defending against newly-discovered prompt injection techniques is to add additional guardrails to the system (e.g., by updating the system prompt or using classifiers on the input and/or output of the model.) However, in the same way that variants of a piece of malware are created to evade anti-virus software, variants of a prompt injection can be created to evade the LLM's guardrails. Ideally, when a new prompt injection technique is discovered, candidate defenses should be tested not only against the successful prompt injection, but also against possible variants. In this work, we present, a tool to assist defenders in performing automated variant analysis of known prompt injection attacks. This involves solving two main challenges: (1) automatically generating variants of a given prompt according, and (2) automatically determining whether a variant was effective based only on the output of the model. This tool can also assist in generating datasets for jailbreak and prompt injection attacks, thus overcoming the scarcity of data in this domain. We evaluate Maatphor on three different types of prompt injection tasks. Starting from an ineffective (0%) seed prompt, Maatphor consistently generates variants that are at least 60% effective within the first 40 iterations.

Related papers

CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks [47.62236306990252]
Large Language Models (LLMs) are susceptible to indirect prompt injection attacks. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. We propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons.
arXiv Detail & Related papers (2025-04-29T23:42:21Z)
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks [101.52204404377039]
LLM-integrated applications and agents are vulnerable to prompt injection attacks. A detection method aims to determine whether a given input is contaminated by an injected prompt. We propose DataSentinel, a game-theoretic method to detect prompt injection attacks.
arXiv Detail & Related papers (2025-04-15T16:26:21Z)
Can Indirect Prompt Injection Attacks Be Detected and Removed? [68.6543680065379]
We investigate the feasibility of detecting and removing indirect prompt injection attacks. For detection, we assess the performance of existing LLMs and open-source detection models. For removal, we evaluate two intuitive methods: (1) the segmentation removal method, which segments the injected document and removes parts containing injected instructions, and (2) the extraction removal method, which trains an extraction model to identify and remove injected instructions.
arXiv Detail & Related papers (2025-02-23T14:02:16Z)
Prompt Inject Detection with Generative Explanation as an Investigative Tool [0.0]
Large Language Models (LLMs) are vulnerable to adversarial prompt based injects. This research explores the use of a text generation capabilities of LLM to detect prompt injects.
arXiv Detail & Related papers (2025-02-16T06:16:00Z)
MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks. We present MELON, a novel IPI defense. We show that MELON outperforms SOTA defenses in both attack prevention and utility preservation.
arXiv Detail & Related papers (2025-02-07T18:57:49Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks. As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks. We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction. We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z)
FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks [45.65210717380502]
Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire. This paper introduces a novel test-time defense strategy, named AuThentication with Hash-based tags (FATH)
arXiv Detail & Related papers (2024-10-28T20:02:47Z)
Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection [6.269725911814401]
Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks. However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem. This project explores the security vulnerabilities in relation to prompt injection attacks.
arXiv Detail & Related papers (2024-10-28T00:36:21Z)
Automatic and Universal Prompt Injection Attacks against Large Language Models [38.694912482525446]
Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. These attacks manipulate applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data.
arXiv Detail & Related papers (2024-03-07T23:46:20Z)
Formalizing and Benchmarking Prompt Injection Attacks and Defenses [59.57908526441172]
We propose a framework to formalize prompt injection attacks. Based on our framework, we design a new attack by combining existing ones. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses.
arXiv Detail & Related papers (2023-10-19T15:12:09Z)
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models [41.1058288041033]
We propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt. Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack.
arXiv Detail & Related papers (2023-05-02T06:19:36Z)
Versatile Weight Attack via Flipping Limited Bits [68.45224286690932]
We study a novel attack paradigm, which modifies model parameters in the deployment stage. Considering the effectiveness and stealthiness goals, we provide a general formulation to perform the bit-flip based weight attack. We present two cases of the general formulation with different malicious purposes, i.e., single sample attack (SSA) and triggered samples attack (TSA)
arXiv Detail & Related papers (2022-07-25T03:24:58Z)
Evolutionary Multi-Task Injection Testing on Web Application Firewalls [11.037455973709532]
DaNuoYi is an automatic injection testing tool that simultaneously generates test inputs for multiple types of injection attacks on a WAF. We conduct experiments on three real-world open-source WAFs and six types of injection attacks. DaNuoYi generates up to 3.8x and 5.78x more valid test inputs (i.e., bypassing the underlying WAF) than its state-of-the-art single-task counterparts.
arXiv Detail & Related papers (2022-06-12T14:11:55Z)
Composite Adversarial Attacks [57.293211764569996]
Adversarial attack is a technique for deceiving Machine Learning (ML) models. In this paper, a new procedure called Composite Adrial Attack (CAA) is proposed for automatically searching the best combination of attack algorithms. CAA beats 10 top attackers on 11 diverse defenses with less elapsed time.
arXiv Detail & Related papers (2020-12-10T03:21:16Z)
Adversarial EXEmples: A Survey and Experimental Evaluation of Practical Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes. We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks. These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.