Related papers: An Early Categorization of Prompt Injection Attacks on Large Language Models

An Early Categorization of Prompt Injection Attacks on Large Language Models

URL: http://arxiv.org/abs/2402.00898v1
Date: Wed, 31 Jan 2024 19:52:00 GMT
Title: An Early Categorization of Prompt Injection Attacks on Large Language Models
Authors: Sippo Rossi, Alisia Marianne Michel, Raghava Rao Mukkamala and Jason Bennett Thatcher
Abstract summary: Large language models and AI chatbots have been at the forefront of democratizing artificial intelligence. We are witnessing a cat-and-mouse game where users attempt to misuse the models with a novel attack called prompt injections. In this paper, we provide an overview of these emergent threats and present a categorization of prompt injections.
Score: 0.8875650122536799
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models and AI chatbots have been at the forefront of democratizing artificial intelligence. However, the releases of ChatGPT and other similar tools have been followed by growing concerns regarding the difficulty of controlling large language models and their outputs. Currently, we are witnessing a cat-and-mouse game where users attempt to misuse the models with a novel attack called prompt injections. In contrast, the developers attempt to discover the vulnerabilities and block the attacks simultaneously. In this paper, we provide an overview of these emergent threats and present a categorization of prompt injections, which can guide future research on prompt injections and act as a checklist of vulnerabilities in the development of LLM interfaces. Moreover, based on previous literature and our own empirical research, we discuss the implications of prompt injections to LLM end users, developers, and researchers.

Related papers

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks [36.97842000562324]
A benchmark called WASP introduces realistic web agent hijacking objectives and an isolated environment to test them. Our evaluation shows that even AI agents backed by models with advanced reasoning capabilities are susceptible to low-effort human-written prompt injections. Agents begin executing the adversarial instruction between 16 and 86% of the time but only achieve the goal between 0 and 17% of the time.
arXiv Detail & Related papers (2025-04-22T17:51:03Z)
Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via Lightweight Prompt Injection [12.565784666173277]
This report presents a real-world case study demonstrating how prompt injection can attack large language model platforms such as ChatGPT. We show how adversarial prompts can be injected via user inputs, web-based retrieval, and system-level agent instructions.
arXiv Detail & Related papers (2025-04-20T05:59:00Z)
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models [30.139590566956077]
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks. We propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs.
arXiv Detail & Related papers (2025-02-18T18:59:00Z)
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context [49.13497493053742]
We focus on human-readable adversarial prompts, which are more realistic and potent threats. Our key contributions are (1) situation-driven attacks leveraging movie scripts as context to create human-readable prompts that successfully deceive LLMs, (2) adversarial suffix conversion to transform nonsensical adversarial suffixes into independent meaningful text, and (3) AdvPrompter with p-nucleus sampling, a method to generate diverse, human-readable adversarial suffixes.
arXiv Detail & Related papers (2024-12-20T21:43:52Z)
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z)
Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness. The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z)
Automatic and Universal Prompt Injection Attacks against Large Language Models [38.694912482525446]
Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. These attacks manipulate applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data.
arXiv Detail & Related papers (2024-03-07T23:46:20Z)
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z)
Formalizing and Benchmarking Prompt Injection Attacks and Defenses [59.57908526441172]
We propose a framework to formalize prompt injection attacks. Based on our framework, we design a new attack by combining existing ones. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses.
arXiv Detail & Related papers (2023-10-19T15:12:09Z)
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications. We show how attackers can override original instructions and employed controls using Prompt Injection attacks. We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers. Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z)
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks [0.2999888908665659]
Adversarial attacks are a major challenge faced by current machine learning research. Our work presents a model-agnostic detector of adversarial text examples.
arXiv Detail & Related papers (2022-04-10T09:24:41Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
Bad Characters: Imperceptible NLP Attacks [16.357959724298745]
A class of adversarial examples can be used to attack text-based models in a black-box setting. We find that with a single imperceptible encoding injection an attacker can significantly reduce the performance of vulnerable models. Our attacks work against currently-deployed commercial systems, including those produced by Microsoft and Google.
arXiv Detail & Related papers (2021-06-18T03:42:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.