An Early Categorization of Prompt Injection Attacks on Large Language
Models
- URL: http://arxiv.org/abs/2402.00898v1
- Date: Wed, 31 Jan 2024 19:52:00 GMT
- Title: An Early Categorization of Prompt Injection Attacks on Large Language
Models
- Authors: Sippo Rossi, Alisia Marianne Michel, Raghava Rao Mukkamala and Jason
Bennett Thatcher
- Abstract summary: Large language models and AI chatbots have been at the forefront of democratizing artificial intelligence.
We are witnessing a cat-and-mouse game where users attempt to misuse the models with a novel attack called prompt injections.
In this paper, we provide an overview of these emergent threats and present a categorization of prompt injections.
- Score: 0.8875650122536799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models and AI chatbots have been at the forefront of
democratizing artificial intelligence. However, the releases of ChatGPT and
other similar tools have been followed by growing concerns regarding the
difficulty of controlling large language models and their outputs. Currently,
we are witnessing a cat-and-mouse game where users attempt to misuse the models
with a novel attack called prompt injections. In contrast, the developers
attempt to discover the vulnerabilities and block the attacks simultaneously.
In this paper, we provide an overview of these emergent threats and present a
categorization of prompt injections, which can guide future research on prompt
injections and act as a checklist of vulnerabilities in the development of LLM
interfaces. Moreover, based on previous literature and our own empirical
research, we discuss the implications of prompt injections to LLM end users,
developers, and researchers.
Related papers
- UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models [30.139590566956077]
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks.
We propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs.
arXiv Detail & Related papers (2025-02-18T18:59:00Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z) - Formalizing and Benchmarking Prompt Injection Attacks and Defenses [59.57908526441172]
We propose a framework to formalize prompt injection attacks.
Based on our framework, we design a new attack by combining existing ones.
Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses.
arXiv Detail & Related papers (2023-10-19T15:12:09Z) - Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.
The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.
We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z) - "That Is a Suspicious Reaction!": Interpreting Logits Variation to
Detect NLP Adversarial Attacks [0.2999888908665659]
Adversarial attacks are a major challenge faced by current machine learning research.
Our work presents a model-agnostic detector of adversarial text examples.
arXiv Detail & Related papers (2022-04-10T09:24:41Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Bad Characters: Imperceptible NLP Attacks [16.357959724298745]
A class of adversarial examples can be used to attack text-based models in a black-box setting.
We find that with a single imperceptible encoding injection an attacker can significantly reduce the performance of vulnerable models.
Our attacks work against currently-deployed commercial systems, including those produced by Microsoft and Google.
arXiv Detail & Related papers (2021-06-18T03:42:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.